ferrero-opentext/Python-Version/MARKDOWN_DOCS/CREATIVEX_DEPLOYMENT.md

16 KiB

CreativeX Score Extraction - Deployment Guide

Overview

This guide covers deploying the CreativeX score extraction system, which:

  1. Monitors Box folder 350605024645 for PDF files
  2. Extracts CreativeX scores using LlamaExtract AI agent "Creativex-Extract"
  3. Stores results in PostgreSQL database with full JSON
  4. Removes processed files from Box
  5. Sends email notifications

Local Development Setup

1. Add Environment Variable

Add to your .env file:

# Box Folder Configuration (add to existing Box section)
BOX_ROOT_FOLDER_CREATIVEX=350605024645

# CreativeX Configuration
LLAMA_CLOUD_API_KEY=your_llama_cloud_api_key_here
CREATIVEX_AGENT_NAME=Creativex-Extract

2. Install Python Dependencies

cd Python-Version
source venv/bin/activate
pip install llama-cloud-services

Or install all dependencies:

pip install -r requirements.txt

3. Create Database Table

If starting fresh (full init):

PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -f database/init.sql

If database already exists (add table only):

PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c "
CREATE TABLE IF NOT EXISTS creativex_scores (
    id SERIAL PRIMARY KEY,
    filename VARCHAR(500) NOT NULL,
    box_file_id VARCHAR(255),
    creativex_id VARCHAR(255),
    creativex_url TEXT,
    quality_score VARCHAR(50),
    full_extraction_data JSONB,
    extracted_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    status VARCHAR(50) DEFAULT 'active',
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
CREATE INDEX IF NOT EXISTS idx_creativex_filename ON creativex_scores(filename);
CREATE INDEX IF NOT EXISTS idx_creativex_box_file ON creativex_scores(box_file_id);
CREATE INDEX IF NOT EXISTS idx_creativex_status ON creativex_scores(status);
"

4. Verify Table Creation

PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c "\d creativex_scores"

You should see:

  • 10 columns (id, filename, box_file_id, creativex_id, creativex_url, quality_score, full_extraction_data, extracted_at, status, created_at)
  • 3 indexes (idx_creativex_filename, idx_creativex_box_file, idx_creativex_status)

5. Test Locally

# Run the script manually
python scripts/creativex_scoring_storing.py

Expected behaviors:

  • If no PDFs in Box folder 350605024645: "No PDF files found" email sent
  • If PDFs present: Extraction runs, scores stored, files deleted from Box
  • If extraction fails: Partial success email with errors

Production Server Deployment

Prerequisites

  • Server already running Ferrero automation (A1→A2, A5→A6, etc.)
  • Git repository backed up to Bitbucket
  • SSH access to production server

Step 1: Update .env on Server

SSH to server and add:

cd /opt/ferrero-automation/Python-Version
nano .env

Add:

# Box Folder Configuration (add to existing Box section)
BOX_ROOT_FOLDER_CREATIVEX=350605024645

# CreativeX Configuration
LLAMA_CLOUD_API_KEY=your_production_llama_cloud_api_key
CREATIVEX_AGENT_NAME=Creativex-Extract

Save and exit (Ctrl+X, Y, Enter).

Step 2: Pull Latest Code

cd /opt/ferrero-automation/Python-Version
git pull origin main

This will include:

  • scripts/creativex_scoring_storing.py
  • Updated database/init.sql
  • Updated scripts/shared/database.py
  • Updated scripts/shared/notifier.py
  • Updated config/config.yaml
  • Updated requirements.txt

Step 3: Install Dependencies

cd /opt/ferrero-automation/Python-Version
source venv/bin/activate
pip install llama-cloud-services

Or update all:

pip install -r requirements.txt --upgrade

Step 4: Create Database Table

PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c "
CREATE TABLE IF NOT EXISTS creativex_scores (
    id SERIAL PRIMARY KEY,
    filename VARCHAR(500) NOT NULL,
    box_file_id VARCHAR(255),
    creativex_id VARCHAR(255),
    creativex_url TEXT,
    quality_score VARCHAR(50),
    full_extraction_data JSONB,
    tracking_id VARCHAR(6),
    extracted_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    status VARCHAR(50) DEFAULT 'active',
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
CREATE INDEX IF NOT EXISTS idx_creativex_filename ON creativex_scores(filename);
CREATE INDEX IF NOT EXISTS idx_creativex_box_file ON creativex_scores(box_file_id);
CREATE INDEX IF NOT EXISTS idx_creativex_status ON creativex_scores(status);
CREATE INDEX IF NOT EXISTS idx_creativex_tracking_id ON creativex_scores(tracking_id);
"

If Table Already Exists (Migration):

# Add tracking_id column to existing table
PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c "
ALTER TABLE creativex_scores ADD COLUMN tracking_id VARCHAR(6);
CREATE INDEX IF NOT EXISTS idx_creativex_tracking_id ON creativex_scores(tracking_id);
"

Note on Status Values:

  • active - Current derivative score (from PDF extraction)
  • superseded - Old derivative score (version history)
  • master-cx-score - Master asset score (from A1→A2 DAM metadata, reference only)

Step 5: Verify Installation

# Check database table
PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c "SELECT COUNT(*) FROM creativex_scores;"

# Check script exists
ls -lh scripts/creativex_scoring_storing.py

# Check it's executable
chmod +x scripts/creativex_scoring_storing.py

Step 6: Test Run

cd /opt/ferrero-automation/Python-Version
source venv/bin/activate
python scripts/creativex_scoring_storing.py

Check logs:

tail -f logs/creativex_scoring.log

Step 7: Add to Cron (Optional - If Automated)

Note: User specified this is manual for now, so skip this step initially.

If you want to automate later (e.g., every hour):

crontab -e

Add:

# CreativeX Score Extraction - Every hour
0 * * * * cd /opt/ferrero-automation/Python-Version && venv/bin/python scripts/creativex_scoring_storing.py >> logs/cron_creativex.log 2>&1

Save and exit.

Configuration Details

Environment Variables

All configuration is centralized in .env:

# Box folder for CreativeX PDFs
BOX_ROOT_FOLDER_CREATIVEX=350605024645

# LlamaCloud API credentials
LLAMA_CLOUD_API_KEY=your_api_key_here

# Agent name in LlamaExtract
CREATIVEX_AGENT_NAME=Creativex-Extract

Box Folder

  • Folder ID: Configured via BOX_ROOT_FOLDER_CREATIVEX (default: 350605024645)
  • Purpose: Drop PDFs here for CreativeX score extraction
  • Behavior: Files are automatically deleted after successful processing

LlamaExtract Agent

  • Agent Name: Configured via CREATIVEX_AGENT_NAME (default: Creativex-Extract)
  • Expected Fields:
    • filename: Original filename from PDF
    • creativeXId.id: CreativeX identifier
    • creativeXId.url: CreativeX URL
    • ferreroCreativeQuality.percentage: Quality score

Database Storage

  • Table: creativex_scores
  • Quick Access Fields: filename, creativex_id, creativex_url, quality_score
  • Full JSON: Stored in full_extraction_data JSONB column
  • Purpose: Future lookups by filename during DAM uploads

Email Notifications

Recipients configured in .env:

  • Success: REPORT_EMAILS
  • Errors: ERROR_EMAIL

Templates:

  1. creativex_complete - All files processed successfully
  2. creativex_partial - Some files failed
  3. creativex_no_files - No PDFs found (normal if folder empty)

Usage

Manual Execution

cd /opt/ferrero-automation/Python-Version
source venv/bin/activate
python scripts/creativex_scoring_storing.py

Workflow

CreativeX PDF Extraction (Manual):

  1. Upload PDFs to Box folder 350605024645
  2. Run script: python scripts/creativex_scoring_storing.py
  3. Script downloads each PDF
  4. LlamaExtract processes PDF
  5. Results stored in database with status='active'
  6. PDF deleted from Box
  7. Email notification sent

Master Asset CreativeX (Automatic):

  1. A1→A2 downloads master asset from DAM
  2. If master has CreativeX score/URL in metadata:
    • Extracts score and URL
    • Stores in database with status='master-cx-score'
    • Links via tracking_id
  3. Used for reference/reporting only (not used in A2→A3 uploads)
  4. Logs "No CreativeX data" if master not scored (normal)

Checking Results

IMPORTANT: Understanding Status Field

The system uses soft delete to preserve history while keeping latest scores easily accessible:

  • status = 'active' → Latest/current derivative score (from PDF extraction)
  • status = 'superseded' → Previous derivative score (history/audit trail)
  • status = 'master-cx-score' → Master asset score (from A1→A2, reference only)

Status Usage:

  • Derivative scores (PDF extraction): When you re-upload the same filename with a new score, the old record is marked superseded and a new active record is created.
  • Master scores (A1→A2): Stored with master-cx-score status and linked via tracking_id. Not used for uploads, only for reference/reporting.

Query for Latest Scores (Most Common):

# View recent ACTIVE extractions (latest scores only)
PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c "
SELECT filename, creativex_id, quality_score, extracted_at
FROM creativex_scores
WHERE status = 'active'
ORDER BY extracted_at DESC
LIMIT 10;
"

# Count total ACTIVE scores (unique filenames with latest scores)
PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c "
SELECT COUNT(*) as active_scores FROM creativex_scores WHERE status = 'active';
"

# Get latest score for specific filename (use this in A2→A3 workflow)
PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c "
SELECT filename, creativex_id, creativex_url, quality_score, extracted_at
FROM creativex_scores
WHERE filename = 'yourfile.mp4' AND status = 'active';
"

Query for Master Scores (Reference/Reporting):

# Get master score for specific tracking ID
PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c "
SELECT filename, quality_score, tracking_id, creativex_url
FROM creativex_scores
WHERE tracking_id = '7xXgKp' AND status = 'master-cx-score';
"

# View all master scores
PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c "
SELECT tracking_id, filename, quality_score, extracted_at
FROM creativex_scores
WHERE status = 'master-cx-score'
ORDER BY extracted_at DESC
LIMIT 10;
"

Query for History/Audit (All Versions):

# View ALL versions of a file (including superseded)
PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c "
SELECT filename, quality_score, status, tracking_id, extracted_at
FROM creativex_scores
WHERE filename = 'yourfile.mp4'
ORDER BY extracted_at DESC;
"

# Count total records by status
PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c "
SELECT
    COUNT(*) as total_records,
    COUNT(*) FILTER (WHERE status = 'active') as active_derivative_scores,
    COUNT(*) FILTER (WHERE status = 'superseded') as superseded_records,
    COUNT(*) FILTER (WHERE status = 'master-cx-score') as master_scores
FROM creativex_scores;
"

# See score changes over time for a file
PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c "
SELECT
    filename,
    quality_score,
    status,
    extracted_at,
    CASE
        WHEN status = 'active' THEN 'CURRENT'
        ELSE 'OLD VERSION'
    END as version_label
FROM creativex_scores
WHERE filename LIKE '%Nutella%'
ORDER BY filename, extracted_at DESC;
"

Viewing Full JSON

PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c "
SELECT filename, full_extraction_data::jsonb
FROM creativex_scores
WHERE filename = 'example.pdf';
"

Future Integration: A2→A3 Workflow

How to Use in DAM Upload Scripts

The database method db.get_creativex_score_by_filename(filename) is ready for use in other scripts.

IMPORTANT: The method automatically filters for status = 'active' to always return the latest score.

Example usage in a2_to_a3_upload_polling.py:

# In a2_to_a3_upload_polling.py or similar
filename = "Brand_Country_Language_123456.mp4"

# Lookup CreativeX score (returns ONLY active/latest score)
score_data = db.get_creativex_score_by_filename(filename)

if score_data:
    # Add to DAM metadata
    dam_metadata['FERRERO.FIELD.CREATIVEX_SCORE'] = score_data['quality_score']
    dam_metadata['FERRERO.FIELD.CREATIVEX_URL'] = score_data['creativex_url']
    dam_metadata['FERRERO.FIELD.CREATIVEX_ID'] = score_data['creativex_id']

    # Optional: Access full JSON for additional fields
    full_data = score_data['full_extraction_data']
    dam_metadata['FERRERO.FIELD.CREATIVEX_BRAND'] = full_data['data']['brand']
    dam_metadata['FERRERO.FIELD.CREATIVEX_MARKET'] = full_data['data']['market']

    logger.info("Added CreativeX score {} to DAM metadata".format(
        score_data['quality_score']
    ))
else:
    logger.warning("No CreativeX score found for: {}".format(filename))

Query Logic in get_creativex_score_by_filename()

The method uses this query internally:

SELECT filename, creativex_id, creativex_url, quality_score,
       box_file_id, full_extraction_data, extracted_at
FROM creativex_scores
WHERE filename = %s AND status = 'active'
ORDER BY extracted_at DESC
LIMIT 1

This ensures you always get the latest score, even if multiple versions exist in history.

Behavior Summary for A2→A3 Integration

Scenario What Happens
Score exists for filename Returns latest active score
Multiple scores exist (history) Returns only the newest active one
No score exists Returns None
File re-scored (same filename) Old score marked superseded, new score is active

Key Takeaway: You never need to worry about duplicates or history in A2→A3 workflow. The query automatically handles it.

Troubleshooting

"llama-cloud-services not installed"

source venv/bin/activate
pip install llama-cloud-services

"Agent 'Creativex-Extract' not found"

  • Verify agent name in LlamaCloud portal
  • Check spelling matches exactly: Creativex-Extract
  • Verify API key is correct

"No PDF files found"

  • This is normal if Box folder 350605024645 is empty
  • Upload test PDF to folder and re-run

"Database connection failed"

# Check PostgreSQL is running
docker ps | grep ferrero

# Test connection
PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c "SELECT 1;"

"Email not sending"

  • Check SMTP configuration in .env
  • Verify Mailgun credentials
  • Check logs for detailed error

Files not deleted from Box

  • This is expected for failed extractions
  • Only successful extractions delete files
  • Failed files remain for manual review/retry

Rollback Instructions

If you need to rollback:

Remove Database Table

PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c "
DROP TABLE IF EXISTS creativex_scores CASCADE;
"

Remove from Cron

crontab -e
# Delete the CreativeX line, save and exit

Revert Code

cd /opt/ferrero-automation/Python-Version
git revert <commit-hash>
git push origin main

Support

  • Logs: logs/creativex_scoring.log
  • Database Queries: See "Checking Results" section above
  • Email Test: Check SMTP settings and recipients list
  • LlamaCloud Issues: Verify API key and agent configuration

Summary Checklist

Local Setup:

  • Add LLAMA_CLOUD_API_KEY to .env
  • Install llama-cloud-services package
  • Create creativex_scores table
  • Test script runs successfully

Production Deployment:

  • Git pull latest code
  • Add LLAMA_CLOUD_API_KEY to server .env
  • Install dependencies on server
  • Create database table on server
  • Test run on server
  • Verify email notifications
  • (Optional) Add to cron if automating

Post-Deployment:

  • Upload test PDF to Box folder 350605024645
  • Run script and verify extraction
  • Check database record created
  • Verify PDF deleted from Box
  • Confirm email notification received