# CreativeX Score Extraction - Deployment Guide ## Overview This guide covers deploying the CreativeX score extraction system, which: 1. Monitors Box folder 350605024645 for PDF files 2. Extracts CreativeX scores using LlamaExtract AI agent "Creativex-Extract" 3. Stores results in PostgreSQL database with full JSON 4. Removes processed files from Box 5. Sends email notifications ## Local Development Setup ### 1. Add Environment Variable Add to your `.env` file: ```bash # Box Folder Configuration (add to existing Box section) BOX_ROOT_FOLDER_CREATIVEX=350605024645 # CreativeX Configuration LLAMA_CLOUD_API_KEY=your_llama_cloud_api_key_here CREATIVEX_AGENT_NAME=Creativex-Extract ``` ### 2. Install Python Dependencies ```bash cd Python-Version source venv/bin/activate pip install llama-cloud-services ``` Or install all dependencies: ```bash pip install -r requirements.txt ``` ### 3. Create Database Table **If starting fresh (full init):** ```bash PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -f database/init.sql ``` **If database already exists (add table only):** ```bash PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c " CREATE TABLE IF NOT EXISTS creativex_scores ( id SERIAL PRIMARY KEY, filename VARCHAR(500) NOT NULL, box_file_id VARCHAR(255), creativex_id VARCHAR(255), creativex_url TEXT, quality_score VARCHAR(50), full_extraction_data JSONB, extracted_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, status VARCHAR(50) DEFAULT 'active', created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ); CREATE INDEX IF NOT EXISTS idx_creativex_filename ON creativex_scores(filename); CREATE INDEX IF NOT EXISTS idx_creativex_box_file ON creativex_scores(box_file_id); CREATE INDEX IF NOT EXISTS idx_creativex_status ON creativex_scores(status); " ``` ### 4. Verify Table Creation ```bash PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c "\d creativex_scores" ``` You should see: - 10 columns (id, filename, box_file_id, creativex_id, creativex_url, quality_score, full_extraction_data, extracted_at, status, created_at) - 3 indexes (idx_creativex_filename, idx_creativex_box_file, idx_creativex_status) ### 5. Test Locally ```bash # Run the script manually python scripts/creativex_scoring_storing.py ``` **Expected behaviors:** - If no PDFs in Box folder 350605024645: "No PDF files found" email sent - If PDFs present: Extraction runs, scores stored, files deleted from Box - If extraction fails: Partial success email with errors ## Production Server Deployment ### Prerequisites - Server already running Ferrero automation (A1→A2, A5→A6, etc.) - Git repository backed up to Bitbucket - SSH access to production server ### Step 1: Update .env on Server SSH to server and add: ```bash cd /opt/ferrero-automation/Python-Version nano .env ``` Add: ```bash # Box Folder Configuration (add to existing Box section) BOX_ROOT_FOLDER_CREATIVEX=350605024645 # CreativeX Configuration LLAMA_CLOUD_API_KEY=your_production_llama_cloud_api_key CREATIVEX_AGENT_NAME=Creativex-Extract ``` Save and exit (Ctrl+X, Y, Enter). ### Step 2: Pull Latest Code ```bash cd /opt/ferrero-automation/Python-Version git pull origin main ``` This will include: - `scripts/creativex_scoring_storing.py` - Updated `database/init.sql` - Updated `scripts/shared/database.py` - Updated `scripts/shared/notifier.py` - Updated `config/config.yaml` - Updated `requirements.txt` ### Step 3: Install Dependencies ```bash cd /opt/ferrero-automation/Python-Version source venv/bin/activate pip install llama-cloud-services ``` Or update all: ```bash pip install -r requirements.txt --upgrade ``` ### Step 4: Create Database Table ```bash PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c " CREATE TABLE IF NOT EXISTS creativex_scores ( id SERIAL PRIMARY KEY, filename VARCHAR(500) NOT NULL, box_file_id VARCHAR(255), creativex_id VARCHAR(255), creativex_url TEXT, quality_score VARCHAR(50), full_extraction_data JSONB, tracking_id VARCHAR(6), extracted_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, status VARCHAR(50) DEFAULT 'active', created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ); CREATE INDEX IF NOT EXISTS idx_creativex_filename ON creativex_scores(filename); CREATE INDEX IF NOT EXISTS idx_creativex_box_file ON creativex_scores(box_file_id); CREATE INDEX IF NOT EXISTS idx_creativex_status ON creativex_scores(status); CREATE INDEX IF NOT EXISTS idx_creativex_tracking_id ON creativex_scores(tracking_id); " ``` **If Table Already Exists (Migration):** ```bash # Add tracking_id column to existing table PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c " ALTER TABLE creativex_scores ADD COLUMN tracking_id VARCHAR(6); CREATE INDEX IF NOT EXISTS idx_creativex_tracking_id ON creativex_scores(tracking_id); " ``` **Note on Status Values:** - `active` - Current derivative score (from PDF extraction) - `superseded` - Old derivative score (version history) - `master-cx-score` - Master asset score (from A1→A2 DAM metadata, reference only) ### Step 5: Verify Installation ```bash # Check database table PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c "SELECT COUNT(*) FROM creativex_scores;" # Check script exists ls -lh scripts/creativex_scoring_storing.py # Check it's executable chmod +x scripts/creativex_scoring_storing.py ``` ### Step 6: Test Run ```bash cd /opt/ferrero-automation/Python-Version source venv/bin/activate python scripts/creativex_scoring_storing.py ``` Check logs: ```bash tail -f logs/creativex_scoring.log ``` ### Step 7: Add to Cron (Optional - If Automated) **Note:** User specified this is manual for now, so skip this step initially. If you want to automate later (e.g., every hour): ```bash crontab -e ``` Add: ```cron # CreativeX Score Extraction - Every hour 0 * * * * cd /opt/ferrero-automation/Python-Version && venv/bin/python scripts/creativex_scoring_storing.py >> logs/cron_creativex.log 2>&1 ``` Save and exit. ## Configuration Details ### Environment Variables All configuration is centralized in `.env`: ```bash # Box folder for CreativeX PDFs BOX_ROOT_FOLDER_CREATIVEX=350605024645 # LlamaCloud API credentials LLAMA_CLOUD_API_KEY=your_api_key_here # Agent name in LlamaExtract CREATIVEX_AGENT_NAME=Creativex-Extract ``` ### Box Folder - **Folder ID:** Configured via `BOX_ROOT_FOLDER_CREATIVEX` (default: 350605024645) - **Purpose:** Drop PDFs here for CreativeX score extraction - **Behavior:** Files are automatically deleted after successful processing ### LlamaExtract Agent - **Agent Name:** Configured via `CREATIVEX_AGENT_NAME` (default: Creativex-Extract) - **Expected Fields:** - `filename`: Original filename from PDF - `creativeXId.id`: CreativeX identifier - `creativeXId.url`: CreativeX URL - `ferreroCreativeQuality.percentage`: Quality score ### Database Storage - **Table:** `creativex_scores` - **Quick Access Fields:** filename, creativex_id, creativex_url, quality_score - **Full JSON:** Stored in `full_extraction_data` JSONB column - **Purpose:** Future lookups by filename during DAM uploads ### Email Notifications **Recipients configured in .env:** - Success: `REPORT_EMAILS` - Errors: `ERROR_EMAIL` **Templates:** 1. `creativex_complete` - All files processed successfully 2. `creativex_partial` - Some files failed 3. `creativex_no_files` - No PDFs found (normal if folder empty) ## Usage ### Manual Execution ```bash cd /opt/ferrero-automation/Python-Version source venv/bin/activate python scripts/creativex_scoring_storing.py ``` ### Workflow ### CreativeX PDF Extraction (Manual): 1. Upload PDFs to Box folder 350605024645 2. Run script: `python scripts/creativex_scoring_storing.py` 3. Script downloads each PDF 4. LlamaExtract processes PDF 5. Results stored in database with status='active' 6. PDF deleted from Box 7. Email notification sent ### Master Asset CreativeX (Automatic): 1. A1→A2 downloads master asset from DAM 2. If master has CreativeX score/URL in metadata: - Extracts score and URL - Stores in database with status='master-cx-score' - Links via tracking_id 3. Used for reference/reporting only (not used in A2→A3 uploads) 4. Logs "No CreativeX data" if master not scored (normal) ### Checking Results **IMPORTANT: Understanding Status Field** The system uses **soft delete** to preserve history while keeping latest scores easily accessible: - `status = 'active'` → Latest/current derivative score (from PDF extraction) - `status = 'superseded'` → Previous derivative score (history/audit trail) - `status = 'master-cx-score'` → Master asset score (from A1→A2, reference only) **Status Usage:** - **Derivative scores (PDF extraction):** When you re-upload the same filename with a new score, the old record is marked `superseded` and a new `active` record is created. - **Master scores (A1→A2):** Stored with `master-cx-score` status and linked via `tracking_id`. Not used for uploads, only for reference/reporting. **Query for Latest Scores (Most Common):** ```bash # View recent ACTIVE extractions (latest scores only) PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c " SELECT filename, creativex_id, quality_score, extracted_at FROM creativex_scores WHERE status = 'active' ORDER BY extracted_at DESC LIMIT 10; " # Count total ACTIVE scores (unique filenames with latest scores) PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c " SELECT COUNT(*) as active_scores FROM creativex_scores WHERE status = 'active'; " # Get latest score for specific filename (use this in A2→A3 workflow) PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c " SELECT filename, creativex_id, creativex_url, quality_score, extracted_at FROM creativex_scores WHERE filename = 'yourfile.mp4' AND status = 'active'; " ``` **Query for Master Scores (Reference/Reporting):** ```bash # Get master score for specific tracking ID PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c " SELECT filename, quality_score, tracking_id, creativex_url FROM creativex_scores WHERE tracking_id = '7xXgKp' AND status = 'master-cx-score'; " # View all master scores PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c " SELECT tracking_id, filename, quality_score, extracted_at FROM creativex_scores WHERE status = 'master-cx-score' ORDER BY extracted_at DESC LIMIT 10; " ``` **Query for History/Audit (All Versions):** ```bash # View ALL versions of a file (including superseded) PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c " SELECT filename, quality_score, status, tracking_id, extracted_at FROM creativex_scores WHERE filename = 'yourfile.mp4' ORDER BY extracted_at DESC; " # Count total records by status PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c " SELECT COUNT(*) as total_records, COUNT(*) FILTER (WHERE status = 'active') as active_derivative_scores, COUNT(*) FILTER (WHERE status = 'superseded') as superseded_records, COUNT(*) FILTER (WHERE status = 'master-cx-score') as master_scores FROM creativex_scores; " # See score changes over time for a file PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c " SELECT filename, quality_score, status, extracted_at, CASE WHEN status = 'active' THEN 'CURRENT' ELSE 'OLD VERSION' END as version_label FROM creativex_scores WHERE filename LIKE '%Nutella%' ORDER BY filename, extracted_at DESC; " ``` ### Viewing Full JSON ```bash PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c " SELECT filename, full_extraction_data::jsonb FROM creativex_scores WHERE filename = 'example.pdf'; " ``` ## Future Integration: A2→A3 Workflow ### How to Use in DAM Upload Scripts The database method `db.get_creativex_score_by_filename(filename)` is ready for use in other scripts. **IMPORTANT:** The method automatically filters for `status = 'active'` to always return the **latest** score. **Example usage in a2_to_a3_upload_polling.py:** ```python # In a2_to_a3_upload_polling.py or similar filename = "Brand_Country_Language_123456.mp4" # Lookup CreativeX score (returns ONLY active/latest score) score_data = db.get_creativex_score_by_filename(filename) if score_data: # Add to DAM metadata dam_metadata['FERRERO.FIELD.CREATIVEX_SCORE'] = score_data['quality_score'] dam_metadata['FERRERO.FIELD.CREATIVEX_URL'] = score_data['creativex_url'] dam_metadata['FERRERO.FIELD.CREATIVEX_ID'] = score_data['creativex_id'] # Optional: Access full JSON for additional fields full_data = score_data['full_extraction_data'] dam_metadata['FERRERO.FIELD.CREATIVEX_BRAND'] = full_data['data']['brand'] dam_metadata['FERRERO.FIELD.CREATIVEX_MARKET'] = full_data['data']['market'] logger.info("Added CreativeX score {} to DAM metadata".format( score_data['quality_score'] )) else: logger.warning("No CreativeX score found for: {}".format(filename)) ``` ### Query Logic in get_creativex_score_by_filename() The method uses this query internally: ```sql SELECT filename, creativex_id, creativex_url, quality_score, box_file_id, full_extraction_data, extracted_at FROM creativex_scores WHERE filename = %s AND status = 'active' ORDER BY extracted_at DESC LIMIT 1 ``` This ensures you **always get the latest score**, even if multiple versions exist in history. ### Behavior Summary for A2→A3 Integration | Scenario | What Happens | |----------|--------------| | Score exists for filename | Returns latest `active` score | | Multiple scores exist (history) | Returns only the newest `active` one | | No score exists | Returns `None` | | File re-scored (same filename) | Old score marked `superseded`, new score is `active` | **Key Takeaway:** You never need to worry about duplicates or history in A2→A3 workflow. The query automatically handles it. ## Troubleshooting ### "llama-cloud-services not installed" ```bash source venv/bin/activate pip install llama-cloud-services ``` ### "Agent 'Creativex-Extract' not found" - Verify agent name in LlamaCloud portal - Check spelling matches exactly: `Creativex-Extract` - Verify API key is correct ### "No PDF files found" - This is normal if Box folder 350605024645 is empty - Upload test PDF to folder and re-run ### "Database connection failed" ```bash # Check PostgreSQL is running docker ps | grep ferrero # Test connection PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c "SELECT 1;" ``` ### "Email not sending" - Check SMTP configuration in .env - Verify Mailgun credentials - Check logs for detailed error ### Files not deleted from Box - This is expected for failed extractions - Only successful extractions delete files - Failed files remain for manual review/retry ## Rollback Instructions If you need to rollback: ### Remove Database Table ```bash PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c " DROP TABLE IF EXISTS creativex_scores CASCADE; " ``` ### Remove from Cron ```bash crontab -e # Delete the CreativeX line, save and exit ``` ### Revert Code ```bash cd /opt/ferrero-automation/Python-Version git revert git push origin main ``` ## Support - **Logs:** `logs/creativex_scoring.log` - **Database Queries:** See "Checking Results" section above - **Email Test:** Check SMTP settings and recipients list - **LlamaCloud Issues:** Verify API key and agent configuration ## Summary Checklist **Local Setup:** - [ ] Add `LLAMA_CLOUD_API_KEY` to .env - [ ] Install `llama-cloud-services` package - [ ] Create `creativex_scores` table - [ ] Test script runs successfully **Production Deployment:** - [ ] Git pull latest code - [ ] Add `LLAMA_CLOUD_API_KEY` to server .env - [ ] Install dependencies on server - [ ] Create database table on server - [ ] Test run on server - [ ] Verify email notifications - [ ] (Optional) Add to cron if automating **Post-Deployment:** - [ ] Upload test PDF to Box folder 350605024645 - [ ] Run script and verify extraction - [ ] Check database record created - [ ] Verify PDF deleted from Box - [ ] Confirm email notification received