# CreativeX Score Extraction - Deployment Guide ## Overview This guide covers deploying the CreativeX score extraction system, which: 1. Monitors Box folder 350605024645 for PDF files 2. Extracts CreativeX scores using LlamaExtract AI agent "Creativex-Extract" 3. Stores results in PostgreSQL database with full JSON 4. Removes processed files from Box 5. Sends email notifications ## Local Development Setup ### 1. Add Environment Variable Add to your `.env` file: ```bash # Box Folder Configuration (add to existing Box section) BOX_ROOT_FOLDER_CREATIVEX=350605024645 # CreativeX Configuration LLAMA_CLOUD_API_KEY=your_llama_cloud_api_key_here CREATIVEX_AGENT_NAME=Creativex-Extract ``` ### 2. Install Python Dependencies ```bash cd Python-Version source venv/bin/activate pip install llama-cloud-services ``` Or install all dependencies: ```bash pip install -r requirements.txt ``` ### 3. Create Database Table **If starting fresh (full init):** ```bash PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -f database/init.sql ``` **If database already exists (add table only):** ```bash PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c " CREATE TABLE IF NOT EXISTS creativex_scores ( id SERIAL PRIMARY KEY, filename VARCHAR(500) NOT NULL, box_file_id VARCHAR(255), creativex_id VARCHAR(255), creativex_url TEXT, quality_score VARCHAR(50), full_extraction_data JSONB, extracted_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, status VARCHAR(50) DEFAULT 'active', created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ); CREATE INDEX IF NOT EXISTS idx_creativex_filename ON creativex_scores(filename); CREATE INDEX IF NOT EXISTS idx_creativex_box_file ON creativex_scores(box_file_id); CREATE INDEX IF NOT EXISTS idx_creativex_status ON creativex_scores(status); " ``` ### 4. Verify Table Creation ```bash PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c "\d creativex_scores" ``` You should see: - 10 columns (id, filename, box_file_id, creativex_id, creativex_url, quality_score, full_extraction_data, extracted_at, status, created_at) - 3 indexes (idx_creativex_filename, idx_creativex_box_file, idx_creativex_status) ### 5. Test Locally ```bash # Run the script manually python scripts/creativex_scoring_storing.py ``` **Expected behaviors:** - If no PDFs in Box folder 350605024645: "No PDF files found" email sent - If PDFs present: Extraction runs, scores stored, files deleted from Box - If extraction fails: Partial success email with errors ## Production Server Deployment ### Prerequisites - Server already running Ferrero automation (A1→A2, A5→A6, etc.) - Git repository backed up to Bitbucket - SSH access to production server ### Step 1: Update .env on Server SSH to server and add: ```bash cd /opt/ferrero-automation/Python-Version nano .env ``` Add: ```bash # Box Folder Configuration (add to existing Box section) BOX_ROOT_FOLDER_CREATIVEX=350605024645 # CreativeX Configuration LLAMA_CLOUD_API_KEY=your_production_llama_cloud_api_key CREATIVEX_AGENT_NAME=Creativex-Extract ``` Save and exit (Ctrl+X, Y, Enter). ### Step 2: Pull Latest Code ```bash cd /opt/ferrero-automation/Python-Version git pull origin main ``` This will include: - `scripts/creativex_scoring_storing.py` - Updated `database/init.sql` - Updated `scripts/shared/database.py` - Updated `scripts/shared/notifier.py` - Updated `config/config.yaml` - Updated `requirements.txt` ### Step 3: Install Dependencies ```bash cd /opt/ferrero-automation/Python-Version source venv/bin/activate pip install llama-cloud-services ``` Or update all: ```bash pip install -r requirements.txt --upgrade ``` ### Step 4: Create Database Table ```bash PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c " CREATE TABLE IF NOT EXISTS creativex_scores ( id SERIAL PRIMARY KEY, filename VARCHAR(500) NOT NULL, box_file_id VARCHAR(255), creativex_id VARCHAR(255), creativex_url TEXT, quality_score VARCHAR(50), full_extraction_data JSONB, extracted_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, status VARCHAR(50) DEFAULT 'active', created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ); CREATE INDEX IF NOT EXISTS idx_creativex_filename ON creativex_scores(filename); CREATE INDEX IF NOT EXISTS idx_creativex_box_file ON creativex_scores(box_file_id); CREATE INDEX IF NOT EXISTS idx_creativex_status ON creativex_scores(status); " ``` **Note on Existing Data:** If you already have records in the table from testing, they will have `status = 'active'` by default. This is correct - they are the current versions. When you re-upload the same filename, the system will mark the old record as `superseded` and create a new `active` record automatically. ### Step 5: Verify Installation ```bash # Check database table PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c "SELECT COUNT(*) FROM creativex_scores;" # Check script exists ls -lh scripts/creativex_scoring_storing.py # Check it's executable chmod +x scripts/creativex_scoring_storing.py ``` ### Step 6: Test Run ```bash cd /opt/ferrero-automation/Python-Version source venv/bin/activate python scripts/creativex_scoring_storing.py ``` Check logs: ```bash tail -f logs/creativex_scoring.log ``` ### Step 7: Add to Cron (Optional - If Automated) **Note:** User specified this is manual for now, so skip this step initially. If you want to automate later (e.g., every hour): ```bash crontab -e ``` Add: ```cron # CreativeX Score Extraction - Every hour 0 * * * * cd /opt/ferrero-automation/Python-Version && venv/bin/python scripts/creativex_scoring_storing.py >> logs/cron_creativex.log 2>&1 ``` Save and exit. ## Configuration Details ### Environment Variables All configuration is centralized in `.env`: ```bash # Box folder for CreativeX PDFs BOX_ROOT_FOLDER_CREATIVEX=350605024645 # LlamaCloud API credentials LLAMA_CLOUD_API_KEY=your_api_key_here # Agent name in LlamaExtract CREATIVEX_AGENT_NAME=Creativex-Extract ``` ### Box Folder - **Folder ID:** Configured via `BOX_ROOT_FOLDER_CREATIVEX` (default: 350605024645) - **Purpose:** Drop PDFs here for CreativeX score extraction - **Behavior:** Files are automatically deleted after successful processing ### LlamaExtract Agent - **Agent Name:** Configured via `CREATIVEX_AGENT_NAME` (default: Creativex-Extract) - **Expected Fields:** - `filename`: Original filename from PDF - `creativeXId.id`: CreativeX identifier - `creativeXId.url`: CreativeX URL - `ferreroCreativeQuality.percentage`: Quality score ### Database Storage - **Table:** `creativex_scores` - **Quick Access Fields:** filename, creativex_id, creativex_url, quality_score - **Full JSON:** Stored in `full_extraction_data` JSONB column - **Purpose:** Future lookups by filename during DAM uploads ### Email Notifications **Recipients configured in .env:** - Success: `REPORT_EMAILS` - Errors: `ERROR_EMAIL` **Templates:** 1. `creativex_complete` - All files processed successfully 2. `creativex_partial` - Some files failed 3. `creativex_no_files` - No PDFs found (normal if folder empty) ## Usage ### Manual Execution ```bash cd /opt/ferrero-automation/Python-Version source venv/bin/activate python scripts/creativex_scoring_storing.py ``` ### Workflow 1. Upload PDFs to Box folder 350605024645 2. Run script (manual or cron) 3. Script downloads each PDF 4. LlamaExtract processes PDF 5. Results stored in database 6. PDF deleted from Box 7. Email notification sent ### Checking Results **IMPORTANT: Understanding Status Field** The system uses **soft delete** to preserve history while keeping latest scores easily accessible: - `status = 'active'` → Latest/current score for this filename - `status = 'superseded'` → Previous score (history/audit trail) When you re-upload the same filename with a new score, the old record is marked `superseded` and a new `active` record is created. **Query for Latest Scores (Most Common):** ```bash # View recent ACTIVE extractions (latest scores only) PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c " SELECT filename, creativex_id, quality_score, extracted_at FROM creativex_scores WHERE status = 'active' ORDER BY extracted_at DESC LIMIT 10; " # Count total ACTIVE scores (unique filenames with latest scores) PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c " SELECT COUNT(*) as active_scores FROM creativex_scores WHERE status = 'active'; " # Get latest score for specific filename (use this in A2→A3 workflow) PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c " SELECT filename, creativex_id, creativex_url, quality_score, extracted_at FROM creativex_scores WHERE filename = 'yourfile.mp4' AND status = 'active'; " ``` **Query for History/Audit (All Versions):** ```bash # View ALL versions of a file (including superseded) PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c " SELECT filename, quality_score, status, extracted_at FROM creativex_scores WHERE filename = 'yourfile.mp4' ORDER BY extracted_at DESC; " # Count total records (including history) PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c " SELECT COUNT(*) as total_records, COUNT(*) FILTER (WHERE status = 'active') as active_records, COUNT(*) FILTER (WHERE status = 'superseded') as superseded_records FROM creativex_scores; " # See score changes over time for a file PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c " SELECT filename, quality_score, status, extracted_at, CASE WHEN status = 'active' THEN 'CURRENT' ELSE 'OLD VERSION' END as version_label FROM creativex_scores WHERE filename LIKE '%Nutella%' ORDER BY filename, extracted_at DESC; " ``` ### Viewing Full JSON ```bash PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c " SELECT filename, full_extraction_data::jsonb FROM creativex_scores WHERE filename = 'example.pdf'; " ``` ## Future Integration: A2→A3 Workflow ### How to Use in DAM Upload Scripts The database method `db.get_creativex_score_by_filename(filename)` is ready for use in other scripts. **IMPORTANT:** The method automatically filters for `status = 'active'` to always return the **latest** score. **Example usage in a2_to_a3_upload_polling.py:** ```python # In a2_to_a3_upload_polling.py or similar filename = "Brand_Country_Language_123456.mp4" # Lookup CreativeX score (returns ONLY active/latest score) score_data = db.get_creativex_score_by_filename(filename) if score_data: # Add to DAM metadata dam_metadata['FERRERO.FIELD.CREATIVEX_SCORE'] = score_data['quality_score'] dam_metadata['FERRERO.FIELD.CREATIVEX_URL'] = score_data['creativex_url'] dam_metadata['FERRERO.FIELD.CREATIVEX_ID'] = score_data['creativex_id'] # Optional: Access full JSON for additional fields full_data = score_data['full_extraction_data'] dam_metadata['FERRERO.FIELD.CREATIVEX_BRAND'] = full_data['data']['brand'] dam_metadata['FERRERO.FIELD.CREATIVEX_MARKET'] = full_data['data']['market'] logger.info("Added CreativeX score {} to DAM metadata".format( score_data['quality_score'] )) else: logger.warning("No CreativeX score found for: {}".format(filename)) ``` ### Query Logic in get_creativex_score_by_filename() The method uses this query internally: ```sql SELECT filename, creativex_id, creativex_url, quality_score, box_file_id, full_extraction_data, extracted_at FROM creativex_scores WHERE filename = %s AND status = 'active' ORDER BY extracted_at DESC LIMIT 1 ``` This ensures you **always get the latest score**, even if multiple versions exist in history. ### Behavior Summary for A2→A3 Integration | Scenario | What Happens | |----------|--------------| | Score exists for filename | Returns latest `active` score | | Multiple scores exist (history) | Returns only the newest `active` one | | No score exists | Returns `None` | | File re-scored (same filename) | Old score marked `superseded`, new score is `active` | **Key Takeaway:** You never need to worry about duplicates or history in A2→A3 workflow. The query automatically handles it. ## Troubleshooting ### "llama-cloud-services not installed" ```bash source venv/bin/activate pip install llama-cloud-services ``` ### "Agent 'Creativex-Extract' not found" - Verify agent name in LlamaCloud portal - Check spelling matches exactly: `Creativex-Extract` - Verify API key is correct ### "No PDF files found" - This is normal if Box folder 350605024645 is empty - Upload test PDF to folder and re-run ### "Database connection failed" ```bash # Check PostgreSQL is running docker ps | grep ferrero # Test connection PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c "SELECT 1;" ``` ### "Email not sending" - Check SMTP configuration in .env - Verify Mailgun credentials - Check logs for detailed error ### Files not deleted from Box - This is expected for failed extractions - Only successful extractions delete files - Failed files remain for manual review/retry ## Rollback Instructions If you need to rollback: ### Remove Database Table ```bash PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c " DROP TABLE IF EXISTS creativex_scores CASCADE; " ``` ### Remove from Cron ```bash crontab -e # Delete the CreativeX line, save and exit ``` ### Revert Code ```bash cd /opt/ferrero-automation/Python-Version git revert git push origin main ``` ## Support - **Logs:** `logs/creativex_scoring.log` - **Database Queries:** See "Checking Results" section above - **Email Test:** Check SMTP settings and recipients list - **LlamaCloud Issues:** Verify API key and agent configuration ## Summary Checklist **Local Setup:** - [ ] Add `LLAMA_CLOUD_API_KEY` to .env - [ ] Install `llama-cloud-services` package - [ ] Create `creativex_scores` table - [ ] Test script runs successfully **Production Deployment:** - [ ] Git pull latest code - [ ] Add `LLAMA_CLOUD_API_KEY` to server .env - [ ] Install dependencies on server - [ ] Create database table on server - [ ] Test run on server - [ ] Verify email notifications - [ ] (Optional) Add to cron if automating **Post-Deployment:** - [ ] Upload test PDF to Box folder 350605024645 - [ ] Run script and verify extraction - [ ] Check database record created - [ ] Verify PDF deleted from Box - [ ] Confirm email notification received