558 lines
16 KiB
Markdown
558 lines
16 KiB
Markdown
# CreativeX Score Extraction - Deployment Guide
|
|
|
|
## Overview
|
|
|
|
This guide covers deploying the CreativeX score extraction system, which:
|
|
1. Monitors Box folder 350605024645 for PDF files
|
|
2. Extracts CreativeX scores using LlamaExtract AI agent "Creativex-Extract"
|
|
3. Stores results in PostgreSQL database with full JSON
|
|
4. Removes processed files from Box
|
|
5. Sends email notifications
|
|
|
|
## Local Development Setup
|
|
|
|
### 1. Add Environment Variable
|
|
|
|
Add to your `.env` file:
|
|
|
|
```bash
|
|
# Box Folder Configuration (add to existing Box section)
|
|
BOX_ROOT_FOLDER_CREATIVEX=350605024645
|
|
|
|
# CreativeX Configuration
|
|
LLAMA_CLOUD_API_KEY=your_llama_cloud_api_key_here
|
|
CREATIVEX_AGENT_NAME=Creativex-Extract
|
|
```
|
|
|
|
### 2. Install Python Dependencies
|
|
|
|
```bash
|
|
cd Python-Version
|
|
source venv/bin/activate
|
|
pip install llama-cloud-services
|
|
```
|
|
|
|
Or install all dependencies:
|
|
|
|
```bash
|
|
pip install -r requirements.txt
|
|
```
|
|
|
|
### 3. Create Database Table
|
|
|
|
**If starting fresh (full init):**
|
|
```bash
|
|
PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -f database/init.sql
|
|
```
|
|
|
|
**If database already exists (add table only):**
|
|
```bash
|
|
PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c "
|
|
CREATE TABLE IF NOT EXISTS creativex_scores (
|
|
id SERIAL PRIMARY KEY,
|
|
filename VARCHAR(500) NOT NULL,
|
|
box_file_id VARCHAR(255),
|
|
creativex_id VARCHAR(255),
|
|
creativex_url TEXT,
|
|
quality_score VARCHAR(50),
|
|
full_extraction_data JSONB,
|
|
extracted_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
|
|
status VARCHAR(50) DEFAULT 'active',
|
|
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
|
|
);
|
|
CREATE INDEX IF NOT EXISTS idx_creativex_filename ON creativex_scores(filename);
|
|
CREATE INDEX IF NOT EXISTS idx_creativex_box_file ON creativex_scores(box_file_id);
|
|
CREATE INDEX IF NOT EXISTS idx_creativex_status ON creativex_scores(status);
|
|
"
|
|
```
|
|
|
|
### 4. Verify Table Creation
|
|
|
|
```bash
|
|
PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c "\d creativex_scores"
|
|
```
|
|
|
|
You should see:
|
|
- 10 columns (id, filename, box_file_id, creativex_id, creativex_url, quality_score, full_extraction_data, extracted_at, status, created_at)
|
|
- 3 indexes (idx_creativex_filename, idx_creativex_box_file, idx_creativex_status)
|
|
|
|
### 5. Test Locally
|
|
|
|
```bash
|
|
# Run the script manually
|
|
python scripts/creativex_scoring_storing.py
|
|
```
|
|
|
|
**Expected behaviors:**
|
|
- If no PDFs in Box folder 350605024645: "No PDF files found" email sent
|
|
- If PDFs present: Extraction runs, scores stored, files deleted from Box
|
|
- If extraction fails: Partial success email with errors
|
|
|
|
## Production Server Deployment
|
|
|
|
### Prerequisites
|
|
- Server already running Ferrero automation (A1→A2, A5→A6, etc.)
|
|
- Git repository backed up to Bitbucket
|
|
- SSH access to production server
|
|
|
|
### Step 1: Update .env on Server
|
|
|
|
SSH to server and add:
|
|
|
|
```bash
|
|
cd /opt/ferrero-automation/Python-Version
|
|
nano .env
|
|
```
|
|
|
|
Add:
|
|
```bash
|
|
# Box Folder Configuration (add to existing Box section)
|
|
BOX_ROOT_FOLDER_CREATIVEX=350605024645
|
|
|
|
# CreativeX Configuration
|
|
LLAMA_CLOUD_API_KEY=your_production_llama_cloud_api_key
|
|
CREATIVEX_AGENT_NAME=Creativex-Extract
|
|
```
|
|
|
|
Save and exit (Ctrl+X, Y, Enter).
|
|
|
|
### Step 2: Pull Latest Code
|
|
|
|
```bash
|
|
cd /opt/ferrero-automation/Python-Version
|
|
git pull origin main
|
|
```
|
|
|
|
This will include:
|
|
- `scripts/creativex_scoring_storing.py`
|
|
- Updated `database/init.sql`
|
|
- Updated `scripts/shared/database.py`
|
|
- Updated `scripts/shared/notifier.py`
|
|
- Updated `config/config.yaml`
|
|
- Updated `requirements.txt`
|
|
|
|
### Step 3: Install Dependencies
|
|
|
|
```bash
|
|
cd /opt/ferrero-automation/Python-Version
|
|
source venv/bin/activate
|
|
pip install llama-cloud-services
|
|
```
|
|
|
|
Or update all:
|
|
```bash
|
|
pip install -r requirements.txt --upgrade
|
|
```
|
|
|
|
### Step 4: Create Database Table
|
|
|
|
```bash
|
|
PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c "
|
|
CREATE TABLE IF NOT EXISTS creativex_scores (
|
|
id SERIAL PRIMARY KEY,
|
|
filename VARCHAR(500) NOT NULL,
|
|
box_file_id VARCHAR(255),
|
|
creativex_id VARCHAR(255),
|
|
creativex_url TEXT,
|
|
quality_score VARCHAR(50),
|
|
full_extraction_data JSONB,
|
|
tracking_id VARCHAR(6),
|
|
extracted_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
|
|
status VARCHAR(50) DEFAULT 'active',
|
|
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
|
|
);
|
|
CREATE INDEX IF NOT EXISTS idx_creativex_filename ON creativex_scores(filename);
|
|
CREATE INDEX IF NOT EXISTS idx_creativex_box_file ON creativex_scores(box_file_id);
|
|
CREATE INDEX IF NOT EXISTS idx_creativex_status ON creativex_scores(status);
|
|
CREATE INDEX IF NOT EXISTS idx_creativex_tracking_id ON creativex_scores(tracking_id);
|
|
"
|
|
```
|
|
|
|
**If Table Already Exists (Migration):**
|
|
```bash
|
|
# Add tracking_id column to existing table
|
|
PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c "
|
|
ALTER TABLE creativex_scores ADD COLUMN tracking_id VARCHAR(6);
|
|
CREATE INDEX IF NOT EXISTS idx_creativex_tracking_id ON creativex_scores(tracking_id);
|
|
"
|
|
```
|
|
|
|
**Note on Status Values:**
|
|
- `active` - Current derivative score (from PDF extraction)
|
|
- `superseded` - Old derivative score (version history)
|
|
- `master-cx-score` - Master asset score (from A1→A2 DAM metadata, reference only)
|
|
|
|
### Step 5: Verify Installation
|
|
|
|
```bash
|
|
# Check database table
|
|
PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c "SELECT COUNT(*) FROM creativex_scores;"
|
|
|
|
# Check script exists
|
|
ls -lh scripts/creativex_scoring_storing.py
|
|
|
|
# Check it's executable
|
|
chmod +x scripts/creativex_scoring_storing.py
|
|
```
|
|
|
|
### Step 6: Test Run
|
|
|
|
```bash
|
|
cd /opt/ferrero-automation/Python-Version
|
|
source venv/bin/activate
|
|
python scripts/creativex_scoring_storing.py
|
|
```
|
|
|
|
Check logs:
|
|
```bash
|
|
tail -f logs/creativex_scoring.log
|
|
```
|
|
|
|
### Step 7: Add to Cron (Optional - If Automated)
|
|
|
|
**Note:** User specified this is manual for now, so skip this step initially.
|
|
|
|
If you want to automate later (e.g., every hour):
|
|
|
|
```bash
|
|
crontab -e
|
|
```
|
|
|
|
Add:
|
|
```cron
|
|
# CreativeX Score Extraction - Every hour
|
|
0 * * * * cd /opt/ferrero-automation/Python-Version && venv/bin/python scripts/creativex_scoring_storing.py >> logs/cron_creativex.log 2>&1
|
|
```
|
|
|
|
Save and exit.
|
|
|
|
## Configuration Details
|
|
|
|
### Environment Variables
|
|
|
|
All configuration is centralized in `.env`:
|
|
|
|
```bash
|
|
# Box folder for CreativeX PDFs
|
|
BOX_ROOT_FOLDER_CREATIVEX=350605024645
|
|
|
|
# LlamaCloud API credentials
|
|
LLAMA_CLOUD_API_KEY=your_api_key_here
|
|
|
|
# Agent name in LlamaExtract
|
|
CREATIVEX_AGENT_NAME=Creativex-Extract
|
|
```
|
|
|
|
### Box Folder
|
|
- **Folder ID:** Configured via `BOX_ROOT_FOLDER_CREATIVEX` (default: 350605024645)
|
|
- **Purpose:** Drop PDFs here for CreativeX score extraction
|
|
- **Behavior:** Files are automatically deleted after successful processing
|
|
|
|
### LlamaExtract Agent
|
|
- **Agent Name:** Configured via `CREATIVEX_AGENT_NAME` (default: Creativex-Extract)
|
|
- **Expected Fields:**
|
|
- `filename`: Original filename from PDF
|
|
- `creativeXId.id`: CreativeX identifier
|
|
- `creativeXId.url`: CreativeX URL
|
|
- `ferreroCreativeQuality.percentage`: Quality score
|
|
|
|
### Database Storage
|
|
- **Table:** `creativex_scores`
|
|
- **Quick Access Fields:** filename, creativex_id, creativex_url, quality_score
|
|
- **Full JSON:** Stored in `full_extraction_data` JSONB column
|
|
- **Purpose:** Future lookups by filename during DAM uploads
|
|
|
|
### Email Notifications
|
|
|
|
**Recipients configured in .env:**
|
|
- Success: `REPORT_EMAILS`
|
|
- Errors: `ERROR_EMAIL`
|
|
|
|
**Templates:**
|
|
1. `creativex_complete` - All files processed successfully
|
|
2. `creativex_partial` - Some files failed
|
|
3. `creativex_no_files` - No PDFs found (normal if folder empty)
|
|
|
|
## Usage
|
|
|
|
### Manual Execution
|
|
|
|
```bash
|
|
cd /opt/ferrero-automation/Python-Version
|
|
source venv/bin/activate
|
|
python scripts/creativex_scoring_storing.py
|
|
```
|
|
|
|
### Workflow
|
|
|
|
### CreativeX PDF Extraction (Manual):
|
|
1. Upload PDFs to Box folder 350605024645
|
|
2. Run script: `python scripts/creativex_scoring_storing.py`
|
|
3. Script downloads each PDF
|
|
4. LlamaExtract processes PDF
|
|
5. Results stored in database with status='active'
|
|
6. PDF deleted from Box
|
|
7. Email notification sent
|
|
|
|
### Master Asset CreativeX (Automatic):
|
|
1. A1→A2 downloads master asset from DAM
|
|
2. If master has CreativeX score/URL in metadata:
|
|
- Extracts score and URL
|
|
- Stores in database with status='master-cx-score'
|
|
- Links via tracking_id
|
|
3. Used for reference/reporting only (not used in A2→A3 uploads)
|
|
4. Logs "No CreativeX data" if master not scored (normal)
|
|
|
|
### Checking Results
|
|
|
|
**IMPORTANT: Understanding Status Field**
|
|
|
|
The system uses **soft delete** to preserve history while keeping latest scores easily accessible:
|
|
- `status = 'active'` → Latest/current derivative score (from PDF extraction)
|
|
- `status = 'superseded'` → Previous derivative score (history/audit trail)
|
|
- `status = 'master-cx-score'` → Master asset score (from A1→A2, reference only)
|
|
|
|
**Status Usage:**
|
|
- **Derivative scores (PDF extraction):** When you re-upload the same filename with a new score, the old record is marked `superseded` and a new `active` record is created.
|
|
- **Master scores (A1→A2):** Stored with `master-cx-score` status and linked via `tracking_id`. Not used for uploads, only for reference/reporting.
|
|
|
|
**Query for Latest Scores (Most Common):**
|
|
```bash
|
|
# View recent ACTIVE extractions (latest scores only)
|
|
PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c "
|
|
SELECT filename, creativex_id, quality_score, extracted_at
|
|
FROM creativex_scores
|
|
WHERE status = 'active'
|
|
ORDER BY extracted_at DESC
|
|
LIMIT 10;
|
|
"
|
|
|
|
# Count total ACTIVE scores (unique filenames with latest scores)
|
|
PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c "
|
|
SELECT COUNT(*) as active_scores FROM creativex_scores WHERE status = 'active';
|
|
"
|
|
|
|
# Get latest score for specific filename (use this in A2→A3 workflow)
|
|
PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c "
|
|
SELECT filename, creativex_id, creativex_url, quality_score, extracted_at
|
|
FROM creativex_scores
|
|
WHERE filename = 'yourfile.mp4' AND status = 'active';
|
|
"
|
|
```
|
|
|
|
**Query for Master Scores (Reference/Reporting):**
|
|
```bash
|
|
# Get master score for specific tracking ID
|
|
PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c "
|
|
SELECT filename, quality_score, tracking_id, creativex_url
|
|
FROM creativex_scores
|
|
WHERE tracking_id = '7xXgKp' AND status = 'master-cx-score';
|
|
"
|
|
|
|
# View all master scores
|
|
PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c "
|
|
SELECT tracking_id, filename, quality_score, extracted_at
|
|
FROM creativex_scores
|
|
WHERE status = 'master-cx-score'
|
|
ORDER BY extracted_at DESC
|
|
LIMIT 10;
|
|
"
|
|
```
|
|
|
|
**Query for History/Audit (All Versions):**
|
|
```bash
|
|
# View ALL versions of a file (including superseded)
|
|
PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c "
|
|
SELECT filename, quality_score, status, tracking_id, extracted_at
|
|
FROM creativex_scores
|
|
WHERE filename = 'yourfile.mp4'
|
|
ORDER BY extracted_at DESC;
|
|
"
|
|
|
|
# Count total records by status
|
|
PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c "
|
|
SELECT
|
|
COUNT(*) as total_records,
|
|
COUNT(*) FILTER (WHERE status = 'active') as active_derivative_scores,
|
|
COUNT(*) FILTER (WHERE status = 'superseded') as superseded_records,
|
|
COUNT(*) FILTER (WHERE status = 'master-cx-score') as master_scores
|
|
FROM creativex_scores;
|
|
"
|
|
|
|
# See score changes over time for a file
|
|
PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c "
|
|
SELECT
|
|
filename,
|
|
quality_score,
|
|
status,
|
|
extracted_at,
|
|
CASE
|
|
WHEN status = 'active' THEN 'CURRENT'
|
|
ELSE 'OLD VERSION'
|
|
END as version_label
|
|
FROM creativex_scores
|
|
WHERE filename LIKE '%Nutella%'
|
|
ORDER BY filename, extracted_at DESC;
|
|
"
|
|
```
|
|
|
|
### Viewing Full JSON
|
|
|
|
```bash
|
|
PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c "
|
|
SELECT filename, full_extraction_data::jsonb
|
|
FROM creativex_scores
|
|
WHERE filename = 'example.pdf';
|
|
"
|
|
```
|
|
|
|
## Future Integration: A2→A3 Workflow
|
|
|
|
### How to Use in DAM Upload Scripts
|
|
|
|
The database method `db.get_creativex_score_by_filename(filename)` is ready for use in other scripts.
|
|
|
|
**IMPORTANT:** The method automatically filters for `status = 'active'` to always return the **latest** score.
|
|
|
|
**Example usage in a2_to_a3_upload_polling.py:**
|
|
|
|
```python
|
|
# In a2_to_a3_upload_polling.py or similar
|
|
filename = "Brand_Country_Language_123456.mp4"
|
|
|
|
# Lookup CreativeX score (returns ONLY active/latest score)
|
|
score_data = db.get_creativex_score_by_filename(filename)
|
|
|
|
if score_data:
|
|
# Add to DAM metadata
|
|
dam_metadata['FERRERO.FIELD.CREATIVEX_SCORE'] = score_data['quality_score']
|
|
dam_metadata['FERRERO.FIELD.CREATIVEX_URL'] = score_data['creativex_url']
|
|
dam_metadata['FERRERO.FIELD.CREATIVEX_ID'] = score_data['creativex_id']
|
|
|
|
# Optional: Access full JSON for additional fields
|
|
full_data = score_data['full_extraction_data']
|
|
dam_metadata['FERRERO.FIELD.CREATIVEX_BRAND'] = full_data['data']['brand']
|
|
dam_metadata['FERRERO.FIELD.CREATIVEX_MARKET'] = full_data['data']['market']
|
|
|
|
logger.info("Added CreativeX score {} to DAM metadata".format(
|
|
score_data['quality_score']
|
|
))
|
|
else:
|
|
logger.warning("No CreativeX score found for: {}".format(filename))
|
|
```
|
|
|
|
### Query Logic in get_creativex_score_by_filename()
|
|
|
|
The method uses this query internally:
|
|
```sql
|
|
SELECT filename, creativex_id, creativex_url, quality_score,
|
|
box_file_id, full_extraction_data, extracted_at
|
|
FROM creativex_scores
|
|
WHERE filename = %s AND status = 'active'
|
|
ORDER BY extracted_at DESC
|
|
LIMIT 1
|
|
```
|
|
|
|
This ensures you **always get the latest score**, even if multiple versions exist in history.
|
|
|
|
### Behavior Summary for A2→A3 Integration
|
|
|
|
| Scenario | What Happens |
|
|
|----------|--------------|
|
|
| Score exists for filename | Returns latest `active` score |
|
|
| Multiple scores exist (history) | Returns only the newest `active` one |
|
|
| No score exists | Returns `None` |
|
|
| File re-scored (same filename) | Old score marked `superseded`, new score is `active` |
|
|
|
|
**Key Takeaway:** You never need to worry about duplicates or history in A2→A3 workflow. The query automatically handles it.
|
|
|
|
## Troubleshooting
|
|
|
|
### "llama-cloud-services not installed"
|
|
```bash
|
|
source venv/bin/activate
|
|
pip install llama-cloud-services
|
|
```
|
|
|
|
### "Agent 'Creativex-Extract' not found"
|
|
- Verify agent name in LlamaCloud portal
|
|
- Check spelling matches exactly: `Creativex-Extract`
|
|
- Verify API key is correct
|
|
|
|
### "No PDF files found"
|
|
- This is normal if Box folder 350605024645 is empty
|
|
- Upload test PDF to folder and re-run
|
|
|
|
### "Database connection failed"
|
|
```bash
|
|
# Check PostgreSQL is running
|
|
docker ps | grep ferrero
|
|
|
|
# Test connection
|
|
PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c "SELECT 1;"
|
|
```
|
|
|
|
### "Email not sending"
|
|
- Check SMTP configuration in .env
|
|
- Verify Mailgun credentials
|
|
- Check logs for detailed error
|
|
|
|
### Files not deleted from Box
|
|
- This is expected for failed extractions
|
|
- Only successful extractions delete files
|
|
- Failed files remain for manual review/retry
|
|
|
|
## Rollback Instructions
|
|
|
|
If you need to rollback:
|
|
|
|
### Remove Database Table
|
|
```bash
|
|
PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c "
|
|
DROP TABLE IF EXISTS creativex_scores CASCADE;
|
|
"
|
|
```
|
|
|
|
### Remove from Cron
|
|
```bash
|
|
crontab -e
|
|
# Delete the CreativeX line, save and exit
|
|
```
|
|
|
|
### Revert Code
|
|
```bash
|
|
cd /opt/ferrero-automation/Python-Version
|
|
git revert <commit-hash>
|
|
git push origin main
|
|
```
|
|
|
|
## Support
|
|
|
|
- **Logs:** `logs/creativex_scoring.log`
|
|
- **Database Queries:** See "Checking Results" section above
|
|
- **Email Test:** Check SMTP settings and recipients list
|
|
- **LlamaCloud Issues:** Verify API key and agent configuration
|
|
|
|
## Summary Checklist
|
|
|
|
**Local Setup:**
|
|
- [ ] Add `LLAMA_CLOUD_API_KEY` to .env
|
|
- [ ] Install `llama-cloud-services` package
|
|
- [ ] Create `creativex_scores` table
|
|
- [ ] Test script runs successfully
|
|
|
|
**Production Deployment:**
|
|
- [ ] Git pull latest code
|
|
- [ ] Add `LLAMA_CLOUD_API_KEY` to server .env
|
|
- [ ] Install dependencies on server
|
|
- [ ] Create database table on server
|
|
- [ ] Test run on server
|
|
- [ ] Verify email notifications
|
|
- [ ] (Optional) Add to cron if automating
|
|
|
|
**Post-Deployment:**
|
|
- [ ] Upload test PDF to Box folder 350605024645
|
|
- [ ] Run script and verify extraction
|
|
- [ ] Check database record created
|
|
- [ ] Verify PDF deleted from Box
|
|
- [ ] Confirm email notification received
|