ferrero-opentext/Python-Version/CREATIVEX_DEPLOYMENT.md
DJP 1264eaf9bc Implement soft delete for CreativeX scores with history preservation
Adds upsert logic that marks old records as 'superseded' while creating
new 'active' records, preserving full history for audit/analysis.

Changes:
- Updated store_creativex_score() to check for existing filename
- Old records marked status='superseded' before inserting new 'active' record
- Returns is_update flag to indicate if this was an update vs new insert
- Logs score changes (e.g., "Score: 80.0 -> 85.0")

Documentation updates:
- Added "Understanding Status Field" section with soft delete explanation
- Separated queries into "Latest Scores" vs "History/Audit" sections
- Added A2→A3 integration guide with example code
- Documented query logic and behavior table for future integration
- Added migration notes for existing data

Query patterns for A2→A3:
- status='active' → Latest/current score (use this in workflows)
- status='superseded' → Previous scores (history/audit trail)
- get_creativex_score_by_filename() automatically filters for active

Benefits:
- Easy lookup of latest scores (just filter status='active')
- Full history preserved for tracking score changes over time
- No data loss when files are re-scored
- Clear audit trail of when scores changed

Tested and verified:
- Existing record (80.0) marked as superseded
- New record (85.0) created as active
- Queries correctly return only active record

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-11 16:40:46 -05:00

14 KiB

CreativeX Score Extraction - Deployment Guide

Overview

This guide covers deploying the CreativeX score extraction system, which:

  1. Monitors Box folder 350605024645 for PDF files
  2. Extracts CreativeX scores using LlamaExtract AI agent "Creativex-Extract"
  3. Stores results in PostgreSQL database with full JSON
  4. Removes processed files from Box
  5. Sends email notifications

Local Development Setup

1. Add Environment Variable

Add to your .env file:

# Box Folder Configuration (add to existing Box section)
BOX_ROOT_FOLDER_CREATIVEX=350605024645

# CreativeX Configuration
LLAMA_CLOUD_API_KEY=your_llama_cloud_api_key_here
CREATIVEX_AGENT_NAME=Creativex-Extract

2. Install Python Dependencies

cd Python-Version
source venv/bin/activate
pip install llama-cloud-services

Or install all dependencies:

pip install -r requirements.txt

3. Create Database Table

If starting fresh (full init):

PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -f database/init.sql

If database already exists (add table only):

PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c "
CREATE TABLE IF NOT EXISTS creativex_scores (
    id SERIAL PRIMARY KEY,
    filename VARCHAR(500) NOT NULL,
    box_file_id VARCHAR(255),
    creativex_id VARCHAR(255),
    creativex_url TEXT,
    quality_score VARCHAR(50),
    full_extraction_data JSONB,
    extracted_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    status VARCHAR(50) DEFAULT 'active',
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
CREATE INDEX IF NOT EXISTS idx_creativex_filename ON creativex_scores(filename);
CREATE INDEX IF NOT EXISTS idx_creativex_box_file ON creativex_scores(box_file_id);
CREATE INDEX IF NOT EXISTS idx_creativex_status ON creativex_scores(status);
"

4. Verify Table Creation

PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c "\d creativex_scores"

You should see:

  • 10 columns (id, filename, box_file_id, creativex_id, creativex_url, quality_score, full_extraction_data, extracted_at, status, created_at)
  • 3 indexes (idx_creativex_filename, idx_creativex_box_file, idx_creativex_status)

5. Test Locally

# Run the script manually
python scripts/creativex_scoring_storing.py

Expected behaviors:

  • If no PDFs in Box folder 350605024645: "No PDF files found" email sent
  • If PDFs present: Extraction runs, scores stored, files deleted from Box
  • If extraction fails: Partial success email with errors

Production Server Deployment

Prerequisites

  • Server already running Ferrero automation (A1→A2, A5→A6, etc.)
  • Git repository backed up to Bitbucket
  • SSH access to production server

Step 1: Update .env on Server

SSH to server and add:

cd /opt/ferrero-automation/Python-Version
nano .env

Add:

# Box Folder Configuration (add to existing Box section)
BOX_ROOT_FOLDER_CREATIVEX=350605024645

# CreativeX Configuration
LLAMA_CLOUD_API_KEY=your_production_llama_cloud_api_key
CREATIVEX_AGENT_NAME=Creativex-Extract

Save and exit (Ctrl+X, Y, Enter).

Step 2: Pull Latest Code

cd /opt/ferrero-automation/Python-Version
git pull origin main

This will include:

  • scripts/creativex_scoring_storing.py
  • Updated database/init.sql
  • Updated scripts/shared/database.py
  • Updated scripts/shared/notifier.py
  • Updated config/config.yaml
  • Updated requirements.txt

Step 3: Install Dependencies

cd /opt/ferrero-automation/Python-Version
source venv/bin/activate
pip install llama-cloud-services

Or update all:

pip install -r requirements.txt --upgrade

Step 4: Create Database Table

PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c "
CREATE TABLE IF NOT EXISTS creativex_scores (
    id SERIAL PRIMARY KEY,
    filename VARCHAR(500) NOT NULL,
    box_file_id VARCHAR(255),
    creativex_id VARCHAR(255),
    creativex_url TEXT,
    quality_score VARCHAR(50),
    full_extraction_data JSONB,
    extracted_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    status VARCHAR(50) DEFAULT 'active',
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
CREATE INDEX IF NOT EXISTS idx_creativex_filename ON creativex_scores(filename);
CREATE INDEX IF NOT EXISTS idx_creativex_box_file ON creativex_scores(box_file_id);
CREATE INDEX IF NOT EXISTS idx_creativex_status ON creativex_scores(status);
"

Note on Existing Data: If you already have records in the table from testing, they will have status = 'active' by default. This is correct - they are the current versions. When you re-upload the same filename, the system will mark the old record as superseded and create a new active record automatically.

Step 5: Verify Installation

# Check database table
PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c "SELECT COUNT(*) FROM creativex_scores;"

# Check script exists
ls -lh scripts/creativex_scoring_storing.py

# Check it's executable
chmod +x scripts/creativex_scoring_storing.py

Step 6: Test Run

cd /opt/ferrero-automation/Python-Version
source venv/bin/activate
python scripts/creativex_scoring_storing.py

Check logs:

tail -f logs/creativex_scoring.log

Step 7: Add to Cron (Optional - If Automated)

Note: User specified this is manual for now, so skip this step initially.

If you want to automate later (e.g., every hour):

crontab -e

Add:

# CreativeX Score Extraction - Every hour
0 * * * * cd /opt/ferrero-automation/Python-Version && venv/bin/python scripts/creativex_scoring_storing.py >> logs/cron_creativex.log 2>&1

Save and exit.

Configuration Details

Environment Variables

All configuration is centralized in .env:

# Box folder for CreativeX PDFs
BOX_ROOT_FOLDER_CREATIVEX=350605024645

# LlamaCloud API credentials
LLAMA_CLOUD_API_KEY=your_api_key_here

# Agent name in LlamaExtract
CREATIVEX_AGENT_NAME=Creativex-Extract

Box Folder

  • Folder ID: Configured via BOX_ROOT_FOLDER_CREATIVEX (default: 350605024645)
  • Purpose: Drop PDFs here for CreativeX score extraction
  • Behavior: Files are automatically deleted after successful processing

LlamaExtract Agent

  • Agent Name: Configured via CREATIVEX_AGENT_NAME (default: Creativex-Extract)
  • Expected Fields:
    • filename: Original filename from PDF
    • creativeXId.id: CreativeX identifier
    • creativeXId.url: CreativeX URL
    • ferreroCreativeQuality.percentage: Quality score

Database Storage

  • Table: creativex_scores
  • Quick Access Fields: filename, creativex_id, creativex_url, quality_score
  • Full JSON: Stored in full_extraction_data JSONB column
  • Purpose: Future lookups by filename during DAM uploads

Email Notifications

Recipients configured in .env:

  • Success: REPORT_EMAILS
  • Errors: ERROR_EMAIL

Templates:

  1. creativex_complete - All files processed successfully
  2. creativex_partial - Some files failed
  3. creativex_no_files - No PDFs found (normal if folder empty)

Usage

Manual Execution

cd /opt/ferrero-automation/Python-Version
source venv/bin/activate
python scripts/creativex_scoring_storing.py

Workflow

  1. Upload PDFs to Box folder 350605024645
  2. Run script (manual or cron)
  3. Script downloads each PDF
  4. LlamaExtract processes PDF
  5. Results stored in database
  6. PDF deleted from Box
  7. Email notification sent

Checking Results

IMPORTANT: Understanding Status Field

The system uses soft delete to preserve history while keeping latest scores easily accessible:

  • status = 'active' → Latest/current score for this filename
  • status = 'superseded' → Previous score (history/audit trail)

When you re-upload the same filename with a new score, the old record is marked superseded and a new active record is created.

Query for Latest Scores (Most Common):

# View recent ACTIVE extractions (latest scores only)
PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c "
SELECT filename, creativex_id, quality_score, extracted_at
FROM creativex_scores
WHERE status = 'active'
ORDER BY extracted_at DESC
LIMIT 10;
"

# Count total ACTIVE scores (unique filenames with latest scores)
PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c "
SELECT COUNT(*) as active_scores FROM creativex_scores WHERE status = 'active';
"

# Get latest score for specific filename (use this in A2→A3 workflow)
PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c "
SELECT filename, creativex_id, creativex_url, quality_score, extracted_at
FROM creativex_scores
WHERE filename = 'yourfile.mp4' AND status = 'active';
"

Query for History/Audit (All Versions):

# View ALL versions of a file (including superseded)
PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c "
SELECT filename, quality_score, status, extracted_at
FROM creativex_scores
WHERE filename = 'yourfile.mp4'
ORDER BY extracted_at DESC;
"

# Count total records (including history)
PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c "
SELECT
    COUNT(*) as total_records,
    COUNT(*) FILTER (WHERE status = 'active') as active_records,
    COUNT(*) FILTER (WHERE status = 'superseded') as superseded_records
FROM creativex_scores;
"

# See score changes over time for a file
PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c "
SELECT
    filename,
    quality_score,
    status,
    extracted_at,
    CASE
        WHEN status = 'active' THEN 'CURRENT'
        ELSE 'OLD VERSION'
    END as version_label
FROM creativex_scores
WHERE filename LIKE '%Nutella%'
ORDER BY filename, extracted_at DESC;
"

Viewing Full JSON

PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c "
SELECT filename, full_extraction_data::jsonb
FROM creativex_scores
WHERE filename = 'example.pdf';
"

Future Integration: A2→A3 Workflow

How to Use in DAM Upload Scripts

The database method db.get_creativex_score_by_filename(filename) is ready for use in other scripts.

IMPORTANT: The method automatically filters for status = 'active' to always return the latest score.

Example usage in a2_to_a3_upload_polling.py:

# In a2_to_a3_upload_polling.py or similar
filename = "Brand_Country_Language_123456.mp4"

# Lookup CreativeX score (returns ONLY active/latest score)
score_data = db.get_creativex_score_by_filename(filename)

if score_data:
    # Add to DAM metadata
    dam_metadata['FERRERO.FIELD.CREATIVEX_SCORE'] = score_data['quality_score']
    dam_metadata['FERRERO.FIELD.CREATIVEX_URL'] = score_data['creativex_url']
    dam_metadata['FERRERO.FIELD.CREATIVEX_ID'] = score_data['creativex_id']

    # Optional: Access full JSON for additional fields
    full_data = score_data['full_extraction_data']
    dam_metadata['FERRERO.FIELD.CREATIVEX_BRAND'] = full_data['data']['brand']
    dam_metadata['FERRERO.FIELD.CREATIVEX_MARKET'] = full_data['data']['market']

    logger.info("Added CreativeX score {} to DAM metadata".format(
        score_data['quality_score']
    ))
else:
    logger.warning("No CreativeX score found for: {}".format(filename))

Query Logic in get_creativex_score_by_filename()

The method uses this query internally:

SELECT filename, creativex_id, creativex_url, quality_score,
       box_file_id, full_extraction_data, extracted_at
FROM creativex_scores
WHERE filename = %s AND status = 'active'
ORDER BY extracted_at DESC
LIMIT 1

This ensures you always get the latest score, even if multiple versions exist in history.

Behavior Summary for A2→A3 Integration

Scenario What Happens
Score exists for filename Returns latest active score
Multiple scores exist (history) Returns only the newest active one
No score exists Returns None
File re-scored (same filename) Old score marked superseded, new score is active

Key Takeaway: You never need to worry about duplicates or history in A2→A3 workflow. The query automatically handles it.

Troubleshooting

"llama-cloud-services not installed"

source venv/bin/activate
pip install llama-cloud-services

"Agent 'Creativex-Extract' not found"

  • Verify agent name in LlamaCloud portal
  • Check spelling matches exactly: Creativex-Extract
  • Verify API key is correct

"No PDF files found"

  • This is normal if Box folder 350605024645 is empty
  • Upload test PDF to folder and re-run

"Database connection failed"

# Check PostgreSQL is running
docker ps | grep ferrero

# Test connection
PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c "SELECT 1;"

"Email not sending"

  • Check SMTP configuration in .env
  • Verify Mailgun credentials
  • Check logs for detailed error

Files not deleted from Box

  • This is expected for failed extractions
  • Only successful extractions delete files
  • Failed files remain for manual review/retry

Rollback Instructions

If you need to rollback:

Remove Database Table

PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c "
DROP TABLE IF EXISTS creativex_scores CASCADE;
"

Remove from Cron

crontab -e
# Delete the CreativeX line, save and exit

Revert Code

cd /opt/ferrero-automation/Python-Version
git revert <commit-hash>
git push origin main

Support

  • Logs: logs/creativex_scoring.log
  • Database Queries: See "Checking Results" section above
  • Email Test: Check SMTP settings and recipients list
  • LlamaCloud Issues: Verify API key and agent configuration

Summary Checklist

Local Setup:

  • Add LLAMA_CLOUD_API_KEY to .env
  • Install llama-cloud-services package
  • Create creativex_scores table
  • Test script runs successfully

Production Deployment:

  • Git pull latest code
  • Add LLAMA_CLOUD_API_KEY to server .env
  • Install dependencies on server
  • Create database table on server
  • Test run on server
  • Verify email notifications
  • (Optional) Add to cron if automating

Post-Deployment:

  • Upload test PDF to Box folder 350605024645
  • Run script and verify extraction
  • Check database record created
  • Verify PDF deleted from Box
  • Confirm email notification received