ferrero-opentext/Python-Version/CREATIVEX_DEPLOYMENT.md
DJP 1264eaf9bc Implement soft delete for CreativeX scores with history preservation
Adds upsert logic that marks old records as 'superseded' while creating
new 'active' records, preserving full history for audit/analysis.

Changes:
- Updated store_creativex_score() to check for existing filename
- Old records marked status='superseded' before inserting new 'active' record
- Returns is_update flag to indicate if this was an update vs new insert
- Logs score changes (e.g., "Score: 80.0 -> 85.0")

Documentation updates:
- Added "Understanding Status Field" section with soft delete explanation
- Separated queries into "Latest Scores" vs "History/Audit" sections
- Added A2→A3 integration guide with example code
- Documented query logic and behavior table for future integration
- Added migration notes for existing data

Query patterns for A2→A3:
- status='active' → Latest/current score (use this in workflows)
- status='superseded' → Previous scores (history/audit trail)
- get_creativex_score_by_filename() automatically filters for active

Benefits:
- Easy lookup of latest scores (just filter status='active')
- Full history preserved for tracking score changes over time
- No data loss when files are re-scored
- Clear audit trail of when scores changed

Tested and verified:
- Existing record (80.0) marked as superseded
- New record (85.0) created as active
- Queries correctly return only active record

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-11 16:40:46 -05:00

511 lines
14 KiB
Markdown

# CreativeX Score Extraction - Deployment Guide
## Overview
This guide covers deploying the CreativeX score extraction system, which:
1. Monitors Box folder 350605024645 for PDF files
2. Extracts CreativeX scores using LlamaExtract AI agent "Creativex-Extract"
3. Stores results in PostgreSQL database with full JSON
4. Removes processed files from Box
5. Sends email notifications
## Local Development Setup
### 1. Add Environment Variable
Add to your `.env` file:
```bash
# Box Folder Configuration (add to existing Box section)
BOX_ROOT_FOLDER_CREATIVEX=350605024645
# CreativeX Configuration
LLAMA_CLOUD_API_KEY=your_llama_cloud_api_key_here
CREATIVEX_AGENT_NAME=Creativex-Extract
```
### 2. Install Python Dependencies
```bash
cd Python-Version
source venv/bin/activate
pip install llama-cloud-services
```
Or install all dependencies:
```bash
pip install -r requirements.txt
```
### 3. Create Database Table
**If starting fresh (full init):**
```bash
PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -f database/init.sql
```
**If database already exists (add table only):**
```bash
PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c "
CREATE TABLE IF NOT EXISTS creativex_scores (
id SERIAL PRIMARY KEY,
filename VARCHAR(500) NOT NULL,
box_file_id VARCHAR(255),
creativex_id VARCHAR(255),
creativex_url TEXT,
quality_score VARCHAR(50),
full_extraction_data JSONB,
extracted_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
status VARCHAR(50) DEFAULT 'active',
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
CREATE INDEX IF NOT EXISTS idx_creativex_filename ON creativex_scores(filename);
CREATE INDEX IF NOT EXISTS idx_creativex_box_file ON creativex_scores(box_file_id);
CREATE INDEX IF NOT EXISTS idx_creativex_status ON creativex_scores(status);
"
```
### 4. Verify Table Creation
```bash
PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c "\d creativex_scores"
```
You should see:
- 10 columns (id, filename, box_file_id, creativex_id, creativex_url, quality_score, full_extraction_data, extracted_at, status, created_at)
- 3 indexes (idx_creativex_filename, idx_creativex_box_file, idx_creativex_status)
### 5. Test Locally
```bash
# Run the script manually
python scripts/creativex_scoring_storing.py
```
**Expected behaviors:**
- If no PDFs in Box folder 350605024645: "No PDF files found" email sent
- If PDFs present: Extraction runs, scores stored, files deleted from Box
- If extraction fails: Partial success email with errors
## Production Server Deployment
### Prerequisites
- Server already running Ferrero automation (A1→A2, A5→A6, etc.)
- Git repository backed up to Bitbucket
- SSH access to production server
### Step 1: Update .env on Server
SSH to server and add:
```bash
cd /opt/ferrero-automation/Python-Version
nano .env
```
Add:
```bash
# Box Folder Configuration (add to existing Box section)
BOX_ROOT_FOLDER_CREATIVEX=350605024645
# CreativeX Configuration
LLAMA_CLOUD_API_KEY=your_production_llama_cloud_api_key
CREATIVEX_AGENT_NAME=Creativex-Extract
```
Save and exit (Ctrl+X, Y, Enter).
### Step 2: Pull Latest Code
```bash
cd /opt/ferrero-automation/Python-Version
git pull origin main
```
This will include:
- `scripts/creativex_scoring_storing.py`
- Updated `database/init.sql`
- Updated `scripts/shared/database.py`
- Updated `scripts/shared/notifier.py`
- Updated `config/config.yaml`
- Updated `requirements.txt`
### Step 3: Install Dependencies
```bash
cd /opt/ferrero-automation/Python-Version
source venv/bin/activate
pip install llama-cloud-services
```
Or update all:
```bash
pip install -r requirements.txt --upgrade
```
### Step 4: Create Database Table
```bash
PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c "
CREATE TABLE IF NOT EXISTS creativex_scores (
id SERIAL PRIMARY KEY,
filename VARCHAR(500) NOT NULL,
box_file_id VARCHAR(255),
creativex_id VARCHAR(255),
creativex_url TEXT,
quality_score VARCHAR(50),
full_extraction_data JSONB,
extracted_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
status VARCHAR(50) DEFAULT 'active',
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
CREATE INDEX IF NOT EXISTS idx_creativex_filename ON creativex_scores(filename);
CREATE INDEX IF NOT EXISTS idx_creativex_box_file ON creativex_scores(box_file_id);
CREATE INDEX IF NOT EXISTS idx_creativex_status ON creativex_scores(status);
"
```
**Note on Existing Data:** If you already have records in the table from testing, they will have `status = 'active'` by default. This is correct - they are the current versions. When you re-upload the same filename, the system will mark the old record as `superseded` and create a new `active` record automatically.
### Step 5: Verify Installation
```bash
# Check database table
PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c "SELECT COUNT(*) FROM creativex_scores;"
# Check script exists
ls -lh scripts/creativex_scoring_storing.py
# Check it's executable
chmod +x scripts/creativex_scoring_storing.py
```
### Step 6: Test Run
```bash
cd /opt/ferrero-automation/Python-Version
source venv/bin/activate
python scripts/creativex_scoring_storing.py
```
Check logs:
```bash
tail -f logs/creativex_scoring.log
```
### Step 7: Add to Cron (Optional - If Automated)
**Note:** User specified this is manual for now, so skip this step initially.
If you want to automate later (e.g., every hour):
```bash
crontab -e
```
Add:
```cron
# CreativeX Score Extraction - Every hour
0 * * * * cd /opt/ferrero-automation/Python-Version && venv/bin/python scripts/creativex_scoring_storing.py >> logs/cron_creativex.log 2>&1
```
Save and exit.
## Configuration Details
### Environment Variables
All configuration is centralized in `.env`:
```bash
# Box folder for CreativeX PDFs
BOX_ROOT_FOLDER_CREATIVEX=350605024645
# LlamaCloud API credentials
LLAMA_CLOUD_API_KEY=your_api_key_here
# Agent name in LlamaExtract
CREATIVEX_AGENT_NAME=Creativex-Extract
```
### Box Folder
- **Folder ID:** Configured via `BOX_ROOT_FOLDER_CREATIVEX` (default: 350605024645)
- **Purpose:** Drop PDFs here for CreativeX score extraction
- **Behavior:** Files are automatically deleted after successful processing
### LlamaExtract Agent
- **Agent Name:** Configured via `CREATIVEX_AGENT_NAME` (default: Creativex-Extract)
- **Expected Fields:**
- `filename`: Original filename from PDF
- `creativeXId.id`: CreativeX identifier
- `creativeXId.url`: CreativeX URL
- `ferreroCreativeQuality.percentage`: Quality score
### Database Storage
- **Table:** `creativex_scores`
- **Quick Access Fields:** filename, creativex_id, creativex_url, quality_score
- **Full JSON:** Stored in `full_extraction_data` JSONB column
- **Purpose:** Future lookups by filename during DAM uploads
### Email Notifications
**Recipients configured in .env:**
- Success: `REPORT_EMAILS`
- Errors: `ERROR_EMAIL`
**Templates:**
1. `creativex_complete` - All files processed successfully
2. `creativex_partial` - Some files failed
3. `creativex_no_files` - No PDFs found (normal if folder empty)
## Usage
### Manual Execution
```bash
cd /opt/ferrero-automation/Python-Version
source venv/bin/activate
python scripts/creativex_scoring_storing.py
```
### Workflow
1. Upload PDFs to Box folder 350605024645
2. Run script (manual or cron)
3. Script downloads each PDF
4. LlamaExtract processes PDF
5. Results stored in database
6. PDF deleted from Box
7. Email notification sent
### Checking Results
**IMPORTANT: Understanding Status Field**
The system uses **soft delete** to preserve history while keeping latest scores easily accessible:
- `status = 'active'` → Latest/current score for this filename
- `status = 'superseded'` → Previous score (history/audit trail)
When you re-upload the same filename with a new score, the old record is marked `superseded` and a new `active` record is created.
**Query for Latest Scores (Most Common):**
```bash
# View recent ACTIVE extractions (latest scores only)
PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c "
SELECT filename, creativex_id, quality_score, extracted_at
FROM creativex_scores
WHERE status = 'active'
ORDER BY extracted_at DESC
LIMIT 10;
"
# Count total ACTIVE scores (unique filenames with latest scores)
PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c "
SELECT COUNT(*) as active_scores FROM creativex_scores WHERE status = 'active';
"
# Get latest score for specific filename (use this in A2→A3 workflow)
PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c "
SELECT filename, creativex_id, creativex_url, quality_score, extracted_at
FROM creativex_scores
WHERE filename = 'yourfile.mp4' AND status = 'active';
"
```
**Query for History/Audit (All Versions):**
```bash
# View ALL versions of a file (including superseded)
PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c "
SELECT filename, quality_score, status, extracted_at
FROM creativex_scores
WHERE filename = 'yourfile.mp4'
ORDER BY extracted_at DESC;
"
# Count total records (including history)
PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c "
SELECT
COUNT(*) as total_records,
COUNT(*) FILTER (WHERE status = 'active') as active_records,
COUNT(*) FILTER (WHERE status = 'superseded') as superseded_records
FROM creativex_scores;
"
# See score changes over time for a file
PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c "
SELECT
filename,
quality_score,
status,
extracted_at,
CASE
WHEN status = 'active' THEN 'CURRENT'
ELSE 'OLD VERSION'
END as version_label
FROM creativex_scores
WHERE filename LIKE '%Nutella%'
ORDER BY filename, extracted_at DESC;
"
```
### Viewing Full JSON
```bash
PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c "
SELECT filename, full_extraction_data::jsonb
FROM creativex_scores
WHERE filename = 'example.pdf';
"
```
## Future Integration: A2→A3 Workflow
### How to Use in DAM Upload Scripts
The database method `db.get_creativex_score_by_filename(filename)` is ready for use in other scripts.
**IMPORTANT:** The method automatically filters for `status = 'active'` to always return the **latest** score.
**Example usage in a2_to_a3_upload_polling.py:**
```python
# In a2_to_a3_upload_polling.py or similar
filename = "Brand_Country_Language_123456.mp4"
# Lookup CreativeX score (returns ONLY active/latest score)
score_data = db.get_creativex_score_by_filename(filename)
if score_data:
# Add to DAM metadata
dam_metadata['FERRERO.FIELD.CREATIVEX_SCORE'] = score_data['quality_score']
dam_metadata['FERRERO.FIELD.CREATIVEX_URL'] = score_data['creativex_url']
dam_metadata['FERRERO.FIELD.CREATIVEX_ID'] = score_data['creativex_id']
# Optional: Access full JSON for additional fields
full_data = score_data['full_extraction_data']
dam_metadata['FERRERO.FIELD.CREATIVEX_BRAND'] = full_data['data']['brand']
dam_metadata['FERRERO.FIELD.CREATIVEX_MARKET'] = full_data['data']['market']
logger.info("Added CreativeX score {} to DAM metadata".format(
score_data['quality_score']
))
else:
logger.warning("No CreativeX score found for: {}".format(filename))
```
### Query Logic in get_creativex_score_by_filename()
The method uses this query internally:
```sql
SELECT filename, creativex_id, creativex_url, quality_score,
box_file_id, full_extraction_data, extracted_at
FROM creativex_scores
WHERE filename = %s AND status = 'active'
ORDER BY extracted_at DESC
LIMIT 1
```
This ensures you **always get the latest score**, even if multiple versions exist in history.
### Behavior Summary for A2→A3 Integration
| Scenario | What Happens |
|----------|--------------|
| Score exists for filename | Returns latest `active` score |
| Multiple scores exist (history) | Returns only the newest `active` one |
| No score exists | Returns `None` |
| File re-scored (same filename) | Old score marked `superseded`, new score is `active` |
**Key Takeaway:** You never need to worry about duplicates or history in A2→A3 workflow. The query automatically handles it.
## Troubleshooting
### "llama-cloud-services not installed"
```bash
source venv/bin/activate
pip install llama-cloud-services
```
### "Agent 'Creativex-Extract' not found"
- Verify agent name in LlamaCloud portal
- Check spelling matches exactly: `Creativex-Extract`
- Verify API key is correct
### "No PDF files found"
- This is normal if Box folder 350605024645 is empty
- Upload test PDF to folder and re-run
### "Database connection failed"
```bash
# Check PostgreSQL is running
docker ps | grep ferrero
# Test connection
PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c "SELECT 1;"
```
### "Email not sending"
- Check SMTP configuration in .env
- Verify Mailgun credentials
- Check logs for detailed error
### Files not deleted from Box
- This is expected for failed extractions
- Only successful extractions delete files
- Failed files remain for manual review/retry
## Rollback Instructions
If you need to rollback:
### Remove Database Table
```bash
PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c "
DROP TABLE IF EXISTS creativex_scores CASCADE;
"
```
### Remove from Cron
```bash
crontab -e
# Delete the CreativeX line, save and exit
```
### Revert Code
```bash
cd /opt/ferrero-automation/Python-Version
git revert <commit-hash>
git push origin main
```
## Support
- **Logs:** `logs/creativex_scoring.log`
- **Database Queries:** See "Checking Results" section above
- **Email Test:** Check SMTP settings and recipients list
- **LlamaCloud Issues:** Verify API key and agent configuration
## Summary Checklist
**Local Setup:**
- [ ] Add `LLAMA_CLOUD_API_KEY` to .env
- [ ] Install `llama-cloud-services` package
- [ ] Create `creativex_scores` table
- [ ] Test script runs successfully
**Production Deployment:**
- [ ] Git pull latest code
- [ ] Add `LLAMA_CLOUD_API_KEY` to server .env
- [ ] Install dependencies on server
- [ ] Create database table on server
- [ ] Test run on server
- [ ] Verify email notifications
- [ ] (Optional) Add to cron if automating
**Post-Deployment:**
- [ ] Upload test PDF to Box folder 350605024645
- [ ] Run script and verify extraction
- [ ] Check database record created
- [ ] Verify PDF deleted from Box
- [ ] Confirm email notification received