ferrero-opentext/Python-Version/MARKDOWN_DOCS/CREATIVEX_DEPLOYMENT.md

558 lines
16 KiB
Markdown

# CreativeX Score Extraction - Deployment Guide
## Overview
This guide covers deploying the CreativeX score extraction system, which:
1. Monitors Box folder 350605024645 for PDF files
2. Extracts CreativeX scores using LlamaExtract AI agent "Creativex-Extract"
3. Stores results in PostgreSQL database with full JSON
4. Removes processed files from Box
5. Sends email notifications
## Local Development Setup
### 1. Add Environment Variable
Add to your `.env` file:
```bash
# Box Folder Configuration (add to existing Box section)
BOX_ROOT_FOLDER_CREATIVEX=350605024645
# CreativeX Configuration
LLAMA_CLOUD_API_KEY=your_llama_cloud_api_key_here
CREATIVEX_AGENT_NAME=Creativex-Extract
```
### 2. Install Python Dependencies
```bash
cd Python-Version
source venv/bin/activate
pip install llama-cloud-services
```
Or install all dependencies:
```bash
pip install -r requirements.txt
```
### 3. Create Database Table
**If starting fresh (full init):**
```bash
PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -f database/init.sql
```
**If database already exists (add table only):**
```bash
PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c "
CREATE TABLE IF NOT EXISTS creativex_scores (
id SERIAL PRIMARY KEY,
filename VARCHAR(500) NOT NULL,
box_file_id VARCHAR(255),
creativex_id VARCHAR(255),
creativex_url TEXT,
quality_score VARCHAR(50),
full_extraction_data JSONB,
extracted_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
status VARCHAR(50) DEFAULT 'active',
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
CREATE INDEX IF NOT EXISTS idx_creativex_filename ON creativex_scores(filename);
CREATE INDEX IF NOT EXISTS idx_creativex_box_file ON creativex_scores(box_file_id);
CREATE INDEX IF NOT EXISTS idx_creativex_status ON creativex_scores(status);
"
```
### 4. Verify Table Creation
```bash
PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c "\d creativex_scores"
```
You should see:
- 10 columns (id, filename, box_file_id, creativex_id, creativex_url, quality_score, full_extraction_data, extracted_at, status, created_at)
- 3 indexes (idx_creativex_filename, idx_creativex_box_file, idx_creativex_status)
### 5. Test Locally
```bash
# Run the script manually
python scripts/creativex_scoring_storing.py
```
**Expected behaviors:**
- If no PDFs in Box folder 350605024645: "No PDF files found" email sent
- If PDFs present: Extraction runs, scores stored, files deleted from Box
- If extraction fails: Partial success email with errors
## Production Server Deployment
### Prerequisites
- Server already running Ferrero automation (A1→A2, A5→A6, etc.)
- Git repository backed up to Bitbucket
- SSH access to production server
### Step 1: Update .env on Server
SSH to server and add:
```bash
cd /opt/ferrero-automation/Python-Version
nano .env
```
Add:
```bash
# Box Folder Configuration (add to existing Box section)
BOX_ROOT_FOLDER_CREATIVEX=350605024645
# CreativeX Configuration
LLAMA_CLOUD_API_KEY=your_production_llama_cloud_api_key
CREATIVEX_AGENT_NAME=Creativex-Extract
```
Save and exit (Ctrl+X, Y, Enter).
### Step 2: Pull Latest Code
```bash
cd /opt/ferrero-automation/Python-Version
git pull origin main
```
This will include:
- `scripts/creativex_scoring_storing.py`
- Updated `database/init.sql`
- Updated `scripts/shared/database.py`
- Updated `scripts/shared/notifier.py`
- Updated `config/config.yaml`
- Updated `requirements.txt`
### Step 3: Install Dependencies
```bash
cd /opt/ferrero-automation/Python-Version
source venv/bin/activate
pip install llama-cloud-services
```
Or update all:
```bash
pip install -r requirements.txt --upgrade
```
### Step 4: Create Database Table
```bash
PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c "
CREATE TABLE IF NOT EXISTS creativex_scores (
id SERIAL PRIMARY KEY,
filename VARCHAR(500) NOT NULL,
box_file_id VARCHAR(255),
creativex_id VARCHAR(255),
creativex_url TEXT,
quality_score VARCHAR(50),
full_extraction_data JSONB,
tracking_id VARCHAR(6),
extracted_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
status VARCHAR(50) DEFAULT 'active',
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
CREATE INDEX IF NOT EXISTS idx_creativex_filename ON creativex_scores(filename);
CREATE INDEX IF NOT EXISTS idx_creativex_box_file ON creativex_scores(box_file_id);
CREATE INDEX IF NOT EXISTS idx_creativex_status ON creativex_scores(status);
CREATE INDEX IF NOT EXISTS idx_creativex_tracking_id ON creativex_scores(tracking_id);
"
```
**If Table Already Exists (Migration):**
```bash
# Add tracking_id column to existing table
PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c "
ALTER TABLE creativex_scores ADD COLUMN tracking_id VARCHAR(6);
CREATE INDEX IF NOT EXISTS idx_creativex_tracking_id ON creativex_scores(tracking_id);
"
```
**Note on Status Values:**
- `active` - Current derivative score (from PDF extraction)
- `superseded` - Old derivative score (version history)
- `master-cx-score` - Master asset score (from A1→A2 DAM metadata, reference only)
### Step 5: Verify Installation
```bash
# Check database table
PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c "SELECT COUNT(*) FROM creativex_scores;"
# Check script exists
ls -lh scripts/creativex_scoring_storing.py
# Check it's executable
chmod +x scripts/creativex_scoring_storing.py
```
### Step 6: Test Run
```bash
cd /opt/ferrero-automation/Python-Version
source venv/bin/activate
python scripts/creativex_scoring_storing.py
```
Check logs:
```bash
tail -f logs/creativex_scoring.log
```
### Step 7: Add to Cron (Optional - If Automated)
**Note:** User specified this is manual for now, so skip this step initially.
If you want to automate later (e.g., every hour):
```bash
crontab -e
```
Add:
```cron
# CreativeX Score Extraction - Every hour
0 * * * * cd /opt/ferrero-automation/Python-Version && venv/bin/python scripts/creativex_scoring_storing.py >> logs/cron_creativex.log 2>&1
```
Save and exit.
## Configuration Details
### Environment Variables
All configuration is centralized in `.env`:
```bash
# Box folder for CreativeX PDFs
BOX_ROOT_FOLDER_CREATIVEX=350605024645
# LlamaCloud API credentials
LLAMA_CLOUD_API_KEY=your_api_key_here
# Agent name in LlamaExtract
CREATIVEX_AGENT_NAME=Creativex-Extract
```
### Box Folder
- **Folder ID:** Configured via `BOX_ROOT_FOLDER_CREATIVEX` (default: 350605024645)
- **Purpose:** Drop PDFs here for CreativeX score extraction
- **Behavior:** Files are automatically deleted after successful processing
### LlamaExtract Agent
- **Agent Name:** Configured via `CREATIVEX_AGENT_NAME` (default: Creativex-Extract)
- **Expected Fields:**
- `filename`: Original filename from PDF
- `creativeXId.id`: CreativeX identifier
- `creativeXId.url`: CreativeX URL
- `ferreroCreativeQuality.percentage`: Quality score
### Database Storage
- **Table:** `creativex_scores`
- **Quick Access Fields:** filename, creativex_id, creativex_url, quality_score
- **Full JSON:** Stored in `full_extraction_data` JSONB column
- **Purpose:** Future lookups by filename during DAM uploads
### Email Notifications
**Recipients configured in .env:**
- Success: `REPORT_EMAILS`
- Errors: `ERROR_EMAIL`
**Templates:**
1. `creativex_complete` - All files processed successfully
2. `creativex_partial` - Some files failed
3. `creativex_no_files` - No PDFs found (normal if folder empty)
## Usage
### Manual Execution
```bash
cd /opt/ferrero-automation/Python-Version
source venv/bin/activate
python scripts/creativex_scoring_storing.py
```
### Workflow
### CreativeX PDF Extraction (Manual):
1. Upload PDFs to Box folder 350605024645
2. Run script: `python scripts/creativex_scoring_storing.py`
3. Script downloads each PDF
4. LlamaExtract processes PDF
5. Results stored in database with status='active'
6. PDF deleted from Box
7. Email notification sent
### Master Asset CreativeX (Automatic):
1. A1→A2 downloads master asset from DAM
2. If master has CreativeX score/URL in metadata:
- Extracts score and URL
- Stores in database with status='master-cx-score'
- Links via tracking_id
3. Used for reference/reporting only (not used in A2→A3 uploads)
4. Logs "No CreativeX data" if master not scored (normal)
### Checking Results
**IMPORTANT: Understanding Status Field**
The system uses **soft delete** to preserve history while keeping latest scores easily accessible:
- `status = 'active'` → Latest/current derivative score (from PDF extraction)
- `status = 'superseded'` → Previous derivative score (history/audit trail)
- `status = 'master-cx-score'` → Master asset score (from A1→A2, reference only)
**Status Usage:**
- **Derivative scores (PDF extraction):** When you re-upload the same filename with a new score, the old record is marked `superseded` and a new `active` record is created.
- **Master scores (A1→A2):** Stored with `master-cx-score` status and linked via `tracking_id`. Not used for uploads, only for reference/reporting.
**Query for Latest Scores (Most Common):**
```bash
# View recent ACTIVE extractions (latest scores only)
PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c "
SELECT filename, creativex_id, quality_score, extracted_at
FROM creativex_scores
WHERE status = 'active'
ORDER BY extracted_at DESC
LIMIT 10;
"
# Count total ACTIVE scores (unique filenames with latest scores)
PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c "
SELECT COUNT(*) as active_scores FROM creativex_scores WHERE status = 'active';
"
# Get latest score for specific filename (use this in A2→A3 workflow)
PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c "
SELECT filename, creativex_id, creativex_url, quality_score, extracted_at
FROM creativex_scores
WHERE filename = 'yourfile.mp4' AND status = 'active';
"
```
**Query for Master Scores (Reference/Reporting):**
```bash
# Get master score for specific tracking ID
PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c "
SELECT filename, quality_score, tracking_id, creativex_url
FROM creativex_scores
WHERE tracking_id = '7xXgKp' AND status = 'master-cx-score';
"
# View all master scores
PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c "
SELECT tracking_id, filename, quality_score, extracted_at
FROM creativex_scores
WHERE status = 'master-cx-score'
ORDER BY extracted_at DESC
LIMIT 10;
"
```
**Query for History/Audit (All Versions):**
```bash
# View ALL versions of a file (including superseded)
PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c "
SELECT filename, quality_score, status, tracking_id, extracted_at
FROM creativex_scores
WHERE filename = 'yourfile.mp4'
ORDER BY extracted_at DESC;
"
# Count total records by status
PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c "
SELECT
COUNT(*) as total_records,
COUNT(*) FILTER (WHERE status = 'active') as active_derivative_scores,
COUNT(*) FILTER (WHERE status = 'superseded') as superseded_records,
COUNT(*) FILTER (WHERE status = 'master-cx-score') as master_scores
FROM creativex_scores;
"
# See score changes over time for a file
PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c "
SELECT
filename,
quality_score,
status,
extracted_at,
CASE
WHEN status = 'active' THEN 'CURRENT'
ELSE 'OLD VERSION'
END as version_label
FROM creativex_scores
WHERE filename LIKE '%Nutella%'
ORDER BY filename, extracted_at DESC;
"
```
### Viewing Full JSON
```bash
PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c "
SELECT filename, full_extraction_data::jsonb
FROM creativex_scores
WHERE filename = 'example.pdf';
"
```
## Future Integration: A2→A3 Workflow
### How to Use in DAM Upload Scripts
The database method `db.get_creativex_score_by_filename(filename)` is ready for use in other scripts.
**IMPORTANT:** The method automatically filters for `status = 'active'` to always return the **latest** score.
**Example usage in a2_to_a3_upload_polling.py:**
```python
# In a2_to_a3_upload_polling.py or similar
filename = "Brand_Country_Language_123456.mp4"
# Lookup CreativeX score (returns ONLY active/latest score)
score_data = db.get_creativex_score_by_filename(filename)
if score_data:
# Add to DAM metadata
dam_metadata['FERRERO.FIELD.CREATIVEX_SCORE'] = score_data['quality_score']
dam_metadata['FERRERO.FIELD.CREATIVEX_URL'] = score_data['creativex_url']
dam_metadata['FERRERO.FIELD.CREATIVEX_ID'] = score_data['creativex_id']
# Optional: Access full JSON for additional fields
full_data = score_data['full_extraction_data']
dam_metadata['FERRERO.FIELD.CREATIVEX_BRAND'] = full_data['data']['brand']
dam_metadata['FERRERO.FIELD.CREATIVEX_MARKET'] = full_data['data']['market']
logger.info("Added CreativeX score {} to DAM metadata".format(
score_data['quality_score']
))
else:
logger.warning("No CreativeX score found for: {}".format(filename))
```
### Query Logic in get_creativex_score_by_filename()
The method uses this query internally:
```sql
SELECT filename, creativex_id, creativex_url, quality_score,
box_file_id, full_extraction_data, extracted_at
FROM creativex_scores
WHERE filename = %s AND status = 'active'
ORDER BY extracted_at DESC
LIMIT 1
```
This ensures you **always get the latest score**, even if multiple versions exist in history.
### Behavior Summary for A2→A3 Integration
| Scenario | What Happens |
|----------|--------------|
| Score exists for filename | Returns latest `active` score |
| Multiple scores exist (history) | Returns only the newest `active` one |
| No score exists | Returns `None` |
| File re-scored (same filename) | Old score marked `superseded`, new score is `active` |
**Key Takeaway:** You never need to worry about duplicates or history in A2→A3 workflow. The query automatically handles it.
## Troubleshooting
### "llama-cloud-services not installed"
```bash
source venv/bin/activate
pip install llama-cloud-services
```
### "Agent 'Creativex-Extract' not found"
- Verify agent name in LlamaCloud portal
- Check spelling matches exactly: `Creativex-Extract`
- Verify API key is correct
### "No PDF files found"
- This is normal if Box folder 350605024645 is empty
- Upload test PDF to folder and re-run
### "Database connection failed"
```bash
# Check PostgreSQL is running
docker ps | grep ferrero
# Test connection
PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c "SELECT 1;"
```
### "Email not sending"
- Check SMTP configuration in .env
- Verify Mailgun credentials
- Check logs for detailed error
### Files not deleted from Box
- This is expected for failed extractions
- Only successful extractions delete files
- Failed files remain for manual review/retry
## Rollback Instructions
If you need to rollback:
### Remove Database Table
```bash
PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c "
DROP TABLE IF EXISTS creativex_scores CASCADE;
"
```
### Remove from Cron
```bash
crontab -e
# Delete the CreativeX line, save and exit
```
### Revert Code
```bash
cd /opt/ferrero-automation/Python-Version
git revert <commit-hash>
git push origin main
```
## Support
- **Logs:** `logs/creativex_scoring.log`
- **Database Queries:** See "Checking Results" section above
- **Email Test:** Check SMTP settings and recipients list
- **LlamaCloud Issues:** Verify API key and agent configuration
## Summary Checklist
**Local Setup:**
- [ ] Add `LLAMA_CLOUD_API_KEY` to .env
- [ ] Install `llama-cloud-services` package
- [ ] Create `creativex_scores` table
- [ ] Test script runs successfully
**Production Deployment:**
- [ ] Git pull latest code
- [ ] Add `LLAMA_CLOUD_API_KEY` to server .env
- [ ] Install dependencies on server
- [ ] Create database table on server
- [ ] Test run on server
- [ ] Verify email notifications
- [ ] (Optional) Add to cron if automating
**Post-Deployment:**
- [ ] Upload test PDF to Box folder 350605024645
- [ ] Run script and verify extraction
- [ ] Check database record created
- [ ] Verify PDF deleted from Box
- [ ] Confirm email notification received