Add CreativeX score extraction and storage system
Implements new workflow to extract CreativeX quality scores from PDFs using LlamaExtract AI and store results in PostgreSQL database. Components added: - creativex_scoring_storing.py: Main script to process PDFs from Box - creativex_scores table: Database table with JSONB for full JSON storage - Database methods: store_creativex_score() and get_creativex_score_by_filename() - Email templates: creativex_complete, creativex_partial, creativex_no_files - Configuration: creativex section in config.yaml - CREATIVEX_DEPLOYMENT.md: Complete deployment and usage guide Features: - Monitors Box folder 350605024645 for PDFs - Extracts scores using LlamaExtract agent "Creativex-Extract" - Stores 4 key fields (filename, ID, URL, score) + full JSON - Deletes processed PDFs from Box after successful extraction - Sends email notifications for success/partial/no-files scenarios - Manual execution (python scripts/creativex_scoring_storing.py) Database schema: - Table: creativex_scores with 10 columns - Indexes on filename, box_file_id, status for fast lookups - JSONB column stores complete extraction for future flexibility Future integration ready: db.get_creativex_score_by_filename() available for DAM upload workflows to attach CreativeX metadata during asset processing. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
parent
a9c3ff6503
commit
b6b9d7337a
7 changed files with 1041 additions and 1 deletions
398
Python-Version/CREATIVEX_DEPLOYMENT.md
Normal file
398
Python-Version/CREATIVEX_DEPLOYMENT.md
Normal file
|
|
@ -0,0 +1,398 @@
|
|||
# CreativeX Score Extraction - Deployment Guide
|
||||
|
||||
## Overview
|
||||
|
||||
This guide covers deploying the CreativeX score extraction system, which:
|
||||
1. Monitors Box folder 350605024645 for PDF files
|
||||
2. Extracts CreativeX scores using LlamaExtract AI agent "Creativex-Extract"
|
||||
3. Stores results in PostgreSQL database with full JSON
|
||||
4. Removes processed files from Box
|
||||
5. Sends email notifications
|
||||
|
||||
## Local Development Setup
|
||||
|
||||
### 1. Add Environment Variable
|
||||
|
||||
Add to your `.env` file:
|
||||
|
||||
```bash
|
||||
# CreativeX Configuration
|
||||
LLAMA_CLOUD_API_KEY=your_llama_cloud_api_key_here
|
||||
```
|
||||
|
||||
### 2. Install Python Dependencies
|
||||
|
||||
```bash
|
||||
cd Python-Version
|
||||
source venv/bin/activate
|
||||
pip install llama-cloud-services
|
||||
```
|
||||
|
||||
Or install all dependencies:
|
||||
|
||||
```bash
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
### 3. Create Database Table
|
||||
|
||||
**If starting fresh (full init):**
|
||||
```bash
|
||||
PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -f database/init.sql
|
||||
```
|
||||
|
||||
**If database already exists (add table only):**
|
||||
```bash
|
||||
PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c "
|
||||
CREATE TABLE IF NOT EXISTS creativex_scores (
|
||||
id SERIAL PRIMARY KEY,
|
||||
filename VARCHAR(500) NOT NULL,
|
||||
box_file_id VARCHAR(255),
|
||||
creativex_id VARCHAR(255),
|
||||
creativex_url TEXT,
|
||||
quality_score VARCHAR(50),
|
||||
full_extraction_data JSONB,
|
||||
extracted_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
|
||||
status VARCHAR(50) DEFAULT 'active',
|
||||
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
|
||||
);
|
||||
CREATE INDEX IF NOT EXISTS idx_creativex_filename ON creativex_scores(filename);
|
||||
CREATE INDEX IF NOT EXISTS idx_creativex_box_file ON creativex_scores(box_file_id);
|
||||
CREATE INDEX IF NOT EXISTS idx_creativex_status ON creativex_scores(status);
|
||||
"
|
||||
```
|
||||
|
||||
### 4. Verify Table Creation
|
||||
|
||||
```bash
|
||||
PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c "\d creativex_scores"
|
||||
```
|
||||
|
||||
You should see:
|
||||
- 10 columns (id, filename, box_file_id, creativex_id, creativex_url, quality_score, full_extraction_data, extracted_at, status, created_at)
|
||||
- 3 indexes (idx_creativex_filename, idx_creativex_box_file, idx_creativex_status)
|
||||
|
||||
### 5. Test Locally
|
||||
|
||||
```bash
|
||||
# Run the script manually
|
||||
python scripts/creativex_scoring_storing.py
|
||||
```
|
||||
|
||||
**Expected behaviors:**
|
||||
- If no PDFs in Box folder 350605024645: "No PDF files found" email sent
|
||||
- If PDFs present: Extraction runs, scores stored, files deleted from Box
|
||||
- If extraction fails: Partial success email with errors
|
||||
|
||||
## Production Server Deployment
|
||||
|
||||
### Prerequisites
|
||||
- Server already running Ferrero automation (A1→A2, A5→A6, etc.)
|
||||
- Git repository backed up to Bitbucket
|
||||
- SSH access to production server
|
||||
|
||||
### Step 1: Update .env on Server
|
||||
|
||||
SSH to server and add:
|
||||
|
||||
```bash
|
||||
cd /opt/ferrero-automation/Python-Version
|
||||
nano .env
|
||||
```
|
||||
|
||||
Add:
|
||||
```bash
|
||||
# CreativeX Configuration
|
||||
LLAMA_CLOUD_API_KEY=your_production_llama_cloud_api_key
|
||||
```
|
||||
|
||||
Save and exit (Ctrl+X, Y, Enter).
|
||||
|
||||
### Step 2: Pull Latest Code
|
||||
|
||||
```bash
|
||||
cd /opt/ferrero-automation/Python-Version
|
||||
git pull origin main
|
||||
```
|
||||
|
||||
This will include:
|
||||
- `scripts/creativex_scoring_storing.py`
|
||||
- Updated `database/init.sql`
|
||||
- Updated `scripts/shared/database.py`
|
||||
- Updated `scripts/shared/notifier.py`
|
||||
- Updated `config/config.yaml`
|
||||
- Updated `requirements.txt`
|
||||
|
||||
### Step 3: Install Dependencies
|
||||
|
||||
```bash
|
||||
cd /opt/ferrero-automation/Python-Version
|
||||
source venv/bin/activate
|
||||
pip install llama-cloud-services
|
||||
```
|
||||
|
||||
Or update all:
|
||||
```bash
|
||||
pip install -r requirements.txt --upgrade
|
||||
```
|
||||
|
||||
### Step 4: Create Database Table
|
||||
|
||||
```bash
|
||||
PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c "
|
||||
CREATE TABLE IF NOT EXISTS creativex_scores (
|
||||
id SERIAL PRIMARY KEY,
|
||||
filename VARCHAR(500) NOT NULL,
|
||||
box_file_id VARCHAR(255),
|
||||
creativex_id VARCHAR(255),
|
||||
creativex_url TEXT,
|
||||
quality_score VARCHAR(50),
|
||||
full_extraction_data JSONB,
|
||||
extracted_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
|
||||
status VARCHAR(50) DEFAULT 'active',
|
||||
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
|
||||
);
|
||||
CREATE INDEX IF NOT EXISTS idx_creativex_filename ON creativex_scores(filename);
|
||||
CREATE INDEX IF NOT EXISTS idx_creativex_box_file ON creativex_scores(box_file_id);
|
||||
CREATE INDEX IF NOT EXISTS idx_creativex_status ON creativex_scores(status);
|
||||
"
|
||||
```
|
||||
|
||||
### Step 5: Verify Installation
|
||||
|
||||
```bash
|
||||
# Check database table
|
||||
PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c "SELECT COUNT(*) FROM creativex_scores;"
|
||||
|
||||
# Check script exists
|
||||
ls -lh scripts/creativex_scoring_storing.py
|
||||
|
||||
# Check it's executable
|
||||
chmod +x scripts/creativex_scoring_storing.py
|
||||
```
|
||||
|
||||
### Step 6: Test Run
|
||||
|
||||
```bash
|
||||
cd /opt/ferrero-automation/Python-Version
|
||||
source venv/bin/activate
|
||||
python scripts/creativex_scoring_storing.py
|
||||
```
|
||||
|
||||
Check logs:
|
||||
```bash
|
||||
tail -f logs/creativex_scoring.log
|
||||
```
|
||||
|
||||
### Step 7: Add to Cron (Optional - If Automated)
|
||||
|
||||
**Note:** User specified this is manual for now, so skip this step initially.
|
||||
|
||||
If you want to automate later (e.g., every hour):
|
||||
|
||||
```bash
|
||||
crontab -e
|
||||
```
|
||||
|
||||
Add:
|
||||
```cron
|
||||
# CreativeX Score Extraction - Every hour
|
||||
0 * * * * cd /opt/ferrero-automation/Python-Version && venv/bin/python scripts/creativex_scoring_storing.py >> logs/cron_creativex.log 2>&1
|
||||
```
|
||||
|
||||
Save and exit.
|
||||
|
||||
## Configuration Details
|
||||
|
||||
### Box Folder
|
||||
- **Folder ID:** 350605024645
|
||||
- **Purpose:** Drop PDFs here for CreativeX score extraction
|
||||
- **Behavior:** Files are automatically deleted after successful processing
|
||||
|
||||
### LlamaExtract Agent
|
||||
- **Agent Name:** Creativex-Extract
|
||||
- **Expected Fields:**
|
||||
- `filename`: Original filename from PDF
|
||||
- `creativeXId.id`: CreativeX identifier
|
||||
- `creativeXId.url`: CreativeX URL
|
||||
- `ferreroCreativeQuality.percentage`: Quality score
|
||||
|
||||
### Database Storage
|
||||
- **Table:** `creativex_scores`
|
||||
- **Quick Access Fields:** filename, creativex_id, creativex_url, quality_score
|
||||
- **Full JSON:** Stored in `full_extraction_data` JSONB column
|
||||
- **Purpose:** Future lookups by filename during DAM uploads
|
||||
|
||||
### Email Notifications
|
||||
|
||||
**Recipients configured in .env:**
|
||||
- Success: `REPORT_EMAILS`
|
||||
- Errors: `ERROR_EMAIL`
|
||||
|
||||
**Templates:**
|
||||
1. `creativex_complete` - All files processed successfully
|
||||
2. `creativex_partial` - Some files failed
|
||||
3. `creativex_no_files` - No PDFs found (normal if folder empty)
|
||||
|
||||
## Usage
|
||||
|
||||
### Manual Execution
|
||||
|
||||
```bash
|
||||
cd /opt/ferrero-automation/Python-Version
|
||||
source venv/bin/activate
|
||||
python scripts/creativex_scoring_storing.py
|
||||
```
|
||||
|
||||
### Workflow
|
||||
|
||||
1. Upload PDFs to Box folder 350605024645
|
||||
2. Run script (manual or cron)
|
||||
3. Script downloads each PDF
|
||||
4. LlamaExtract processes PDF
|
||||
5. Results stored in database
|
||||
6. PDF deleted from Box
|
||||
7. Email notification sent
|
||||
|
||||
### Checking Results
|
||||
|
||||
```bash
|
||||
# View recent extractions
|
||||
PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c "
|
||||
SELECT filename, creativex_id, quality_score, extracted_at
|
||||
FROM creativex_scores
|
||||
ORDER BY extracted_at DESC
|
||||
LIMIT 10;
|
||||
"
|
||||
|
||||
# Count total scores
|
||||
PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c "
|
||||
SELECT COUNT(*) as total_scores FROM creativex_scores WHERE status = 'active';
|
||||
"
|
||||
|
||||
# View specific file
|
||||
PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c "
|
||||
SELECT * FROM creativex_scores WHERE filename LIKE '%yourfile%';
|
||||
"
|
||||
```
|
||||
|
||||
### Viewing Full JSON
|
||||
|
||||
```bash
|
||||
PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c "
|
||||
SELECT filename, full_extraction_data::jsonb
|
||||
FROM creativex_scores
|
||||
WHERE filename = 'example.pdf';
|
||||
"
|
||||
```
|
||||
|
||||
## Future Integration
|
||||
|
||||
The database method `db.get_creativex_score_by_filename(filename)` is ready for use in other scripts.
|
||||
|
||||
**Example usage in future DAM upload workflow:**
|
||||
|
||||
```python
|
||||
# In a2_to_a3_upload_polling.py or similar
|
||||
filename = "Brand_Country_Language_123456.mp4"
|
||||
|
||||
# Lookup CreativeX score
|
||||
score_data = db.get_creativex_score_by_filename(filename)
|
||||
|
||||
if score_data:
|
||||
# Add to DAM metadata
|
||||
dam_metadata['FERRERO.FIELD.CREATIVEX_SCORE'] = score_data['quality_score']
|
||||
dam_metadata['FERRERO.FIELD.CREATIVEX_URL'] = score_data['creativex_url']
|
||||
dam_metadata['FERRERO.FIELD.CREATIVEX_ID'] = score_data['creativex_id']
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### "llama-cloud-services not installed"
|
||||
```bash
|
||||
source venv/bin/activate
|
||||
pip install llama-cloud-services
|
||||
```
|
||||
|
||||
### "Agent 'Creativex-Extract' not found"
|
||||
- Verify agent name in LlamaCloud portal
|
||||
- Check spelling matches exactly: `Creativex-Extract`
|
||||
- Verify API key is correct
|
||||
|
||||
### "No PDF files found"
|
||||
- This is normal if Box folder 350605024645 is empty
|
||||
- Upload test PDF to folder and re-run
|
||||
|
||||
### "Database connection failed"
|
||||
```bash
|
||||
# Check PostgreSQL is running
|
||||
docker ps | grep ferrero
|
||||
|
||||
# Test connection
|
||||
PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c "SELECT 1;"
|
||||
```
|
||||
|
||||
### "Email not sending"
|
||||
- Check SMTP configuration in .env
|
||||
- Verify Mailgun credentials
|
||||
- Check logs for detailed error
|
||||
|
||||
### Files not deleted from Box
|
||||
- This is expected for failed extractions
|
||||
- Only successful extractions delete files
|
||||
- Failed files remain for manual review/retry
|
||||
|
||||
## Rollback Instructions
|
||||
|
||||
If you need to rollback:
|
||||
|
||||
### Remove Database Table
|
||||
```bash
|
||||
PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c "
|
||||
DROP TABLE IF EXISTS creativex_scores CASCADE;
|
||||
"
|
||||
```
|
||||
|
||||
### Remove from Cron
|
||||
```bash
|
||||
crontab -e
|
||||
# Delete the CreativeX line, save and exit
|
||||
```
|
||||
|
||||
### Revert Code
|
||||
```bash
|
||||
cd /opt/ferrero-automation/Python-Version
|
||||
git revert <commit-hash>
|
||||
git push origin main
|
||||
```
|
||||
|
||||
## Support
|
||||
|
||||
- **Logs:** `logs/creativex_scoring.log`
|
||||
- **Database Queries:** See "Checking Results" section above
|
||||
- **Email Test:** Check SMTP settings and recipients list
|
||||
- **LlamaCloud Issues:** Verify API key and agent configuration
|
||||
|
||||
## Summary Checklist
|
||||
|
||||
**Local Setup:**
|
||||
- [ ] Add `LLAMA_CLOUD_API_KEY` to .env
|
||||
- [ ] Install `llama-cloud-services` package
|
||||
- [ ] Create `creativex_scores` table
|
||||
- [ ] Test script runs successfully
|
||||
|
||||
**Production Deployment:**
|
||||
- [ ] Git pull latest code
|
||||
- [ ] Add `LLAMA_CLOUD_API_KEY` to server .env
|
||||
- [ ] Install dependencies on server
|
||||
- [ ] Create database table on server
|
||||
- [ ] Test run on server
|
||||
- [ ] Verify email notifications
|
||||
- [ ] (Optional) Add to cron if automating
|
||||
|
||||
**Post-Deployment:**
|
||||
- [ ] Upload test PDF to Box folder 350605024645
|
||||
- [ ] Run script and verify extraction
|
||||
- [ ] Check database record created
|
||||
- [ ] Verify PDF deleted from Box
|
||||
- [ ] Confirm email notification received
|
||||
|
|
@ -95,6 +95,12 @@ notifications:
|
|||
fields:
|
||||
mappings_file: config/field_mappings.yaml
|
||||
|
||||
# CreativeX Configuration
|
||||
creativex:
|
||||
llama_api_key: ${LLAMA_CLOUD_API_KEY}
|
||||
agent_name: Creativex-Extract
|
||||
box_folder_id: "350605024645"
|
||||
|
||||
# Logging Configuration
|
||||
logging:
|
||||
level: INFO
|
||||
|
|
|
|||
|
|
@ -172,6 +172,35 @@ CREATE TABLE IF NOT EXISTS campaign_status (
|
|||
|
||||
\echo 'Table campaign_status created'
|
||||
|
||||
-- ============================================================================
|
||||
-- Table: creativex_scores
|
||||
-- Purpose: Stores CreativeX quality scores extracted from PDFs via LlamaExtract
|
||||
-- ============================================================================
|
||||
|
||||
CREATE TABLE IF NOT EXISTS creativex_scores (
|
||||
-- Primary Key
|
||||
id SERIAL PRIMARY KEY,
|
||||
|
||||
-- File Information
|
||||
filename VARCHAR(500) NOT NULL,
|
||||
box_file_id VARCHAR(255),
|
||||
|
||||
-- CreativeX Data (parsed fields for quick access)
|
||||
creativex_id VARCHAR(255),
|
||||
creativex_url TEXT,
|
||||
quality_score VARCHAR(50),
|
||||
|
||||
-- Full Extraction Data (JSONB - Complete LlamaExtract response for future use)
|
||||
full_extraction_data JSONB,
|
||||
|
||||
-- Timestamps
|
||||
extracted_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
|
||||
status VARCHAR(50) DEFAULT 'active',
|
||||
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
|
||||
);
|
||||
|
||||
\echo 'Table creativex_scores created'
|
||||
|
||||
\echo 'Tables created successfully'
|
||||
|
||||
-- ============================================================================
|
||||
|
|
@ -211,6 +240,11 @@ CREATE INDEX IF NOT EXISTS idx_campaign_status_status ON campaign_status(status)
|
|||
CREATE INDEX IF NOT EXISTS idx_campaign_status_live ON campaign_status(live_campaign);
|
||||
CREATE INDEX IF NOT EXISTS idx_campaign_status_webhook_sent ON campaign_status(webhook_sent);
|
||||
|
||||
-- creativex_scores indexes
|
||||
CREATE INDEX IF NOT EXISTS idx_creativex_filename ON creativex_scores(filename);
|
||||
CREATE INDEX IF NOT EXISTS idx_creativex_box_file ON creativex_scores(box_file_id);
|
||||
CREATE INDEX IF NOT EXISTS idx_creativex_status ON creativex_scores(status);
|
||||
|
||||
\echo 'Indexes created successfully'
|
||||
|
||||
-- ============================================================================
|
||||
|
|
@ -323,8 +357,10 @@ GRANT USAGE ON SCHEMA public TO ferrero_user;
|
|||
\echo ' - derivative_assets'
|
||||
\echo ' - asset_events'
|
||||
\echo ' - workflow_state'
|
||||
\echo ' - campaign_status'
|
||||
\echo ' - creativex_scores'
|
||||
\echo ''
|
||||
\echo 'Indexes created: 12'
|
||||
\echo 'Indexes created: 15'
|
||||
\echo 'Triggers created: 4'
|
||||
\echo 'Functions created: 2'
|
||||
\echo ''
|
||||
|
|
|
|||
|
|
@ -24,6 +24,9 @@ cryptography>=3.4.0
|
|||
# Email templates
|
||||
Jinja2>=3.0.0
|
||||
|
||||
# LlamaExtract for CreativeX score extraction
|
||||
llama-cloud-services>=0.1.0
|
||||
|
||||
# Retry logic
|
||||
tenacity>=8.0.0
|
||||
|
||||
|
|
|
|||
396
Python-Version/scripts/creativex_scoring_storing.py
Executable file
396
Python-Version/scripts/creativex_scoring_storing.py
Executable file
|
|
@ -0,0 +1,396 @@
|
|||
#!/usr/bin/env python3
|
||||
"""
|
||||
CreativeX Score Extractor and Storage
|
||||
Processes PDFs from Box folder 350605024645, extracts CreativeX scores using LlamaExtract,
|
||||
stores results in database, and removes processed files from Box.
|
||||
Compatible with Python 3.6+
|
||||
"""
|
||||
|
||||
import sys
|
||||
import os
|
||||
import logging
|
||||
from datetime import datetime
|
||||
from pathlib import Path
|
||||
|
||||
# Add shared library to path
|
||||
sys.path.insert(0, os.path.join(os.path.dirname(__file__), '..'))
|
||||
|
||||
from shared.config_loader import load_config
|
||||
from shared.box_client import BoxClient
|
||||
from shared.database import Database
|
||||
from shared.notifier import Notifier
|
||||
|
||||
# Setup logging with rotation
|
||||
from logging.handlers import RotatingFileHandler
|
||||
|
||||
# Create logs directory if it doesn't exist
|
||||
os.makedirs('logs', exist_ok=True)
|
||||
|
||||
# Configure logging with rotation
|
||||
log_handler = RotatingFileHandler(
|
||||
'logs/creativex_scoring.log',
|
||||
maxBytes=10*1024*1024, # 10MB per file
|
||||
backupCount=28 # Keep 28 rotated files (approximately 1 month)
|
||||
)
|
||||
log_handler.setLevel(logging.INFO)
|
||||
log_handler.setFormatter(logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s'))
|
||||
|
||||
console_handler = logging.StreamHandler()
|
||||
console_handler.setLevel(logging.INFO)
|
||||
console_handler.setFormatter(logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s'))
|
||||
|
||||
logging.basicConfig(
|
||||
level=logging.INFO,
|
||||
handlers=[log_handler, console_handler]
|
||||
)
|
||||
|
||||
logger = logging.getLogger('CreativeXScoring')
|
||||
|
||||
|
||||
class CreativeXExtractor:
|
||||
"""Handles extraction of CreativeX data from PDF files using LlamaExtract."""
|
||||
|
||||
def __init__(self, api_key, agent_name):
|
||||
"""
|
||||
Initialize the Llama Extract client.
|
||||
|
||||
Args:
|
||||
api_key: LlamaCloud API key
|
||||
agent_name: Agent name in LlamaExtract
|
||||
"""
|
||||
try:
|
||||
from llama_cloud_services import LlamaExtract
|
||||
self.extractor = LlamaExtract(api_key=api_key)
|
||||
self.agent_name = agent_name
|
||||
logger.info("LlamaExtract client initialized with agent: {}".format(agent_name))
|
||||
except ImportError:
|
||||
logger.error("llama-cloud-services not installed. Run: pip install llama-cloud-services")
|
||||
raise
|
||||
except Exception as e:
|
||||
logger.error("Failed to initialize LlamaExtract: {}".format(str(e)))
|
||||
raise
|
||||
|
||||
def extract_from_file(self, file_path):
|
||||
"""
|
||||
Extract data from a PDF file using Llama Extract.
|
||||
|
||||
Args:
|
||||
file_path: Path to the PDF file
|
||||
|
||||
Returns:
|
||||
Dictionary containing the extraction result, or None if extraction fails
|
||||
"""
|
||||
try:
|
||||
logger.info(" Getting agent: {}".format(self.agent_name))
|
||||
agent = self.extractor.get_agent(name=self.agent_name)
|
||||
|
||||
if agent is None:
|
||||
raise Exception("Agent '{}' not found".format(self.agent_name))
|
||||
|
||||
logger.info(" Running extraction on: {}".format(os.path.basename(file_path)))
|
||||
result = agent.extract(str(file_path))
|
||||
|
||||
# Convert result to dictionary format
|
||||
extraction_data = {
|
||||
"run_id": getattr(result, "run_id", None),
|
||||
"extraction_agent_id": getattr(result, "extraction_agent_id", None),
|
||||
"data": result.data if hasattr(result, "data") else {},
|
||||
"extraction_metadata": getattr(result, "extraction_metadata", {})
|
||||
}
|
||||
|
||||
return extraction_data
|
||||
|
||||
except Exception as e:
|
||||
logger.error(" ERROR: Extraction failed - {}".format(str(e)))
|
||||
return None
|
||||
|
||||
def parse_csv_fields(self, extraction_data):
|
||||
"""
|
||||
Parse specific fields for database storage.
|
||||
|
||||
Expected fields:
|
||||
- filename
|
||||
- creativeXId.id
|
||||
- creativeXId.url
|
||||
- ferreroCreativeQuality.percentage
|
||||
|
||||
Args:
|
||||
extraction_data: Full extraction result dictionary
|
||||
|
||||
Returns:
|
||||
Dictionary with parsed fields, or None if required fields are missing
|
||||
"""
|
||||
try:
|
||||
data = extraction_data.get("data", {})
|
||||
|
||||
# Extract filename
|
||||
filename = data.get("filename", "")
|
||||
|
||||
# Extract creativeXId fields
|
||||
creative_x_id_obj = data.get("creativeXId", {})
|
||||
creative_x_id = creative_x_id_obj.get("id", "") if isinstance(creative_x_id_obj, dict) else ""
|
||||
creative_x_url = creative_x_id_obj.get("url", "") if isinstance(creative_x_id_obj, dict) else ""
|
||||
|
||||
# Extract ferreroCreativeQuality percentage
|
||||
ferrero_quality_obj = data.get("ferreroCreativeQuality", {})
|
||||
quality_score = ferrero_quality_obj.get("percentage", "") if isinstance(ferrero_quality_obj, dict) else ""
|
||||
|
||||
# Validate that we have the critical fields
|
||||
if not filename:
|
||||
logger.warning(" WARNING: filename field is missing from extraction data")
|
||||
|
||||
return {
|
||||
"filename": filename,
|
||||
"id": creative_x_id,
|
||||
"url": creative_x_url,
|
||||
"score": quality_score
|
||||
}
|
||||
|
||||
except Exception as e:
|
||||
logger.error(" ERROR: Failed to parse CSV fields - {}".format(str(e)))
|
||||
return None
|
||||
|
||||
|
||||
def process_pdfs(box_client, db, extractor, notifier, config):
|
||||
"""
|
||||
Process all PDFs in the CreativeX Box folder.
|
||||
|
||||
Args:
|
||||
box_client: BoxClient instance
|
||||
db: Database instance
|
||||
extractor: CreativeXExtractor instance
|
||||
notifier: Notifier instance
|
||||
config: Configuration dict
|
||||
|
||||
Returns:
|
||||
dict with processing results
|
||||
"""
|
||||
creativex_folder_id = config['creativex']['box_folder_id']
|
||||
|
||||
logger.info("=" * 60)
|
||||
logger.info("CreativeX Score Extraction")
|
||||
logger.info("=" * 60)
|
||||
logger.info("Box Folder ID: {}".format(creativex_folder_id))
|
||||
logger.info("")
|
||||
|
||||
try:
|
||||
# List all PDF files in Box folder
|
||||
files = box_client.list_folder_files(creativex_folder_id)
|
||||
pdf_files = [f for f in files if f['name'].lower().endswith('.pdf')]
|
||||
|
||||
if not pdf_files:
|
||||
logger.info("No PDF files found in Box folder")
|
||||
|
||||
# Send email notification
|
||||
notifier.send_email(
|
||||
template_name='creativex_no_files',
|
||||
recipients=config['notifications']['recipients']['success'],
|
||||
data={
|
||||
'timestamp': datetime.now().strftime("%Y-%m-%d %H:%M:%S")
|
||||
}
|
||||
)
|
||||
|
||||
return {'success': True, 'file_count': 0, 'processed': 0, 'failed': 0}
|
||||
|
||||
logger.info("Found {} PDF file(s) to process".format(len(pdf_files)))
|
||||
logger.info("")
|
||||
|
||||
# Create temp directory
|
||||
temp_dir = Path('temp/creativex')
|
||||
temp_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
# Track results
|
||||
processed_files = []
|
||||
failed_files = []
|
||||
|
||||
# Process each PDF
|
||||
for idx, file_info in enumerate(pdf_files, 1):
|
||||
file_id = file_info['id']
|
||||
filename = file_info['name']
|
||||
|
||||
logger.info("[{}/{}] Processing: {}".format(idx, len(pdf_files), filename))
|
||||
|
||||
try:
|
||||
# 1. Download PDF from Box
|
||||
temp_file_path = temp_dir / filename
|
||||
box_client.download_file(file_id, str(temp_file_path))
|
||||
|
||||
# 2. Extract data using LlamaExtract
|
||||
extraction_data = extractor.extract_from_file(str(temp_file_path))
|
||||
|
||||
if extraction_data is None:
|
||||
raise Exception("Extraction returned None")
|
||||
|
||||
# 3. Parse fields
|
||||
parsed_fields = extractor.parse_csv_fields(extraction_data)
|
||||
|
||||
if not parsed_fields:
|
||||
raise Exception("Failed to parse extraction fields")
|
||||
|
||||
# 4. Store in database with full JSON
|
||||
db_result = db.store_creativex_score(
|
||||
filename=parsed_fields['filename'],
|
||||
creativex_id=parsed_fields['id'],
|
||||
creativex_url=parsed_fields['url'],
|
||||
quality_score=parsed_fields['score'],
|
||||
box_file_id=file_id,
|
||||
full_extraction_data=extraction_data
|
||||
)
|
||||
|
||||
if not db_result['success']:
|
||||
raise Exception("Database storage failed: {}".format(db_result.get('error', 'Unknown')))
|
||||
|
||||
# 5. Delete file from Box (only after successful storage)
|
||||
try:
|
||||
box_file = box_client.client.file(file_id)
|
||||
box_file.delete()
|
||||
logger.info(" Deleted from Box: {}".format(filename))
|
||||
except Exception as e:
|
||||
logger.warning(" Could not delete file from Box: {}".format(str(e)))
|
||||
# Don't fail the whole process if delete fails
|
||||
|
||||
# 6. Clean up local temp file
|
||||
try:
|
||||
os.remove(str(temp_file_path))
|
||||
except Exception as e:
|
||||
logger.warning(" Could not delete temp file: {}".format(str(e)))
|
||||
|
||||
# Track success
|
||||
processed_files.append({
|
||||
'filename': parsed_fields['filename'],
|
||||
'creativex_id': parsed_fields['id'],
|
||||
'creativex_url': parsed_fields['url'],
|
||||
'quality_score': parsed_fields['score'],
|
||||
'box_file_id': file_id
|
||||
})
|
||||
|
||||
logger.info(" ✓ Success: Score {} extracted and stored".format(parsed_fields['score']))
|
||||
logger.info("")
|
||||
|
||||
except Exception as e:
|
||||
logger.error(" ✗ Failed: {}".format(str(e)))
|
||||
logger.info("")
|
||||
|
||||
failed_files.append({
|
||||
'filename': filename,
|
||||
'box_file_id': file_id,
|
||||
'error': str(e)
|
||||
})
|
||||
|
||||
# Clean up temp file if it exists
|
||||
try:
|
||||
temp_file_path = temp_dir / filename
|
||||
if temp_file_path.exists():
|
||||
os.remove(str(temp_file_path))
|
||||
except:
|
||||
pass
|
||||
|
||||
# Summary
|
||||
total_files = len(pdf_files)
|
||||
success_count = len(processed_files)
|
||||
failed_count = len(failed_files)
|
||||
|
||||
logger.info("=" * 60)
|
||||
logger.info("Processing Complete")
|
||||
logger.info("=" * 60)
|
||||
logger.info("Total Files: {}".format(total_files))
|
||||
logger.info("Successful: {}".format(success_count))
|
||||
logger.info("Failed: {}".format(failed_count))
|
||||
logger.info("")
|
||||
|
||||
# Send email notification
|
||||
if failed_count == 0:
|
||||
# All successful
|
||||
notifier.send_email(
|
||||
template_name='creativex_complete',
|
||||
recipients=config['notifications']['recipients']['success'],
|
||||
data={
|
||||
'file_count': total_files,
|
||||
'success_count': success_count,
|
||||
'processed_files': processed_files
|
||||
}
|
||||
)
|
||||
else:
|
||||
# Partial success
|
||||
notifier.send_email(
|
||||
template_name='creativex_partial',
|
||||
recipients=config['notifications']['recipients']['errors'],
|
||||
data={
|
||||
'file_count': total_files,
|
||||
'success_count': success_count,
|
||||
'failed_count': failed_count,
|
||||
'processed_files': processed_files,
|
||||
'failed_files': failed_files
|
||||
}
|
||||
)
|
||||
|
||||
return {
|
||||
'success': success_count > 0,
|
||||
'file_count': total_files,
|
||||
'processed': success_count,
|
||||
'failed': failed_count
|
||||
}
|
||||
|
||||
except Exception as e:
|
||||
logger.error("FATAL ERROR: {}".format(str(e)))
|
||||
return {'success': False, 'error': str(e)}
|
||||
|
||||
|
||||
def main():
|
||||
"""Entry point."""
|
||||
try:
|
||||
logger.info("Starting CreativeX Score Extraction")
|
||||
logger.info("")
|
||||
|
||||
# Load configuration
|
||||
config = load_config('config/config.yaml')
|
||||
|
||||
# Initialize clients
|
||||
logger.info("Initializing clients...")
|
||||
|
||||
# Box client for CreativeX folder
|
||||
box = BoxClient(config, root_folder_id=config['creativex']['box_folder_id'])
|
||||
|
||||
# Database
|
||||
db = Database(config)
|
||||
|
||||
# Notifier
|
||||
notifier = Notifier(config)
|
||||
|
||||
# CreativeX Extractor
|
||||
extractor = CreativeXExtractor(
|
||||
api_key=config['creativex']['llama_api_key'],
|
||||
agent_name=config['creativex']['agent_name']
|
||||
)
|
||||
|
||||
logger.info("Clients initialized successfully")
|
||||
logger.info("")
|
||||
|
||||
# Process PDFs
|
||||
result = process_pdfs(box, db, extractor, notifier, config)
|
||||
|
||||
if result['success']:
|
||||
logger.info("✓ CreativeX extraction completed successfully")
|
||||
sys.exit(0)
|
||||
else:
|
||||
logger.error("✗ CreativeX extraction failed")
|
||||
sys.exit(1)
|
||||
|
||||
except KeyboardInterrupt:
|
||||
logger.info("\n\nProcess interrupted by user.")
|
||||
sys.exit(1)
|
||||
except Exception as e:
|
||||
logger.error("\nFATAL ERROR: {}".format(str(e)))
|
||||
import traceback
|
||||
traceback.print_exc()
|
||||
sys.exit(1)
|
||||
finally:
|
||||
# Close database connections
|
||||
try:
|
||||
db.close()
|
||||
except:
|
||||
pass
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
|
|
@ -536,6 +536,103 @@ class Database:
|
|||
cursor.close()
|
||||
self.put_connection(conn)
|
||||
|
||||
def store_creativex_score(self, filename, creativex_id, creativex_url, quality_score, box_file_id, full_extraction_data):
|
||||
"""
|
||||
Store CreativeX score data extracted from PDF
|
||||
|
||||
Args:
|
||||
filename: Original filename from extraction
|
||||
creativex_id: CreativeX ID from extraction
|
||||
creativex_url: CreativeX URL from extraction
|
||||
quality_score: Quality score percentage
|
||||
box_file_id: Box file ID for tracking
|
||||
full_extraction_data: Complete LlamaExtract JSON response
|
||||
|
||||
Returns:
|
||||
dict with success boolean
|
||||
"""
|
||||
conn = self.get_connection()
|
||||
try:
|
||||
cursor = conn.cursor()
|
||||
|
||||
# Convert full_extraction_data to JSON string if it's a dict
|
||||
import json
|
||||
full_json = json.dumps(full_extraction_data) if isinstance(full_extraction_data, dict) else full_extraction_data
|
||||
|
||||
cursor.execute("""
|
||||
INSERT INTO creativex_scores (
|
||||
filename, creativex_id, creativex_url, quality_score,
|
||||
box_file_id, full_extraction_data
|
||||
) VALUES (%s, %s, %s, %s, %s, %s)
|
||||
""", (
|
||||
filename,
|
||||
creativex_id,
|
||||
creativex_url,
|
||||
quality_score,
|
||||
box_file_id,
|
||||
full_json
|
||||
))
|
||||
|
||||
conn.commit()
|
||||
logger.info("Stored CreativeX score: {} (Score: {})".format(filename, quality_score))
|
||||
|
||||
return {'success': True}
|
||||
|
||||
except Exception as e:
|
||||
conn.rollback()
|
||||
logger.error("Failed to store CreativeX score: {}".format(str(e)))
|
||||
return {'success': False, 'error': str(e)}
|
||||
|
||||
finally:
|
||||
cursor.close()
|
||||
self.put_connection(conn)
|
||||
|
||||
def get_creativex_score_by_filename(self, filename):
|
||||
"""
|
||||
Get CreativeX score data by filename
|
||||
|
||||
Args:
|
||||
filename: Filename to search for
|
||||
|
||||
Returns:
|
||||
dict with creativex data or None if not found
|
||||
"""
|
||||
conn = self.get_connection()
|
||||
try:
|
||||
cursor = conn.cursor()
|
||||
|
||||
cursor.execute("""
|
||||
SELECT filename, creativex_id, creativex_url, quality_score,
|
||||
box_file_id, full_extraction_data, extracted_at
|
||||
FROM creativex_scores
|
||||
WHERE filename = %s AND status = 'active'
|
||||
ORDER BY extracted_at DESC
|
||||
LIMIT 1
|
||||
""", (filename,))
|
||||
|
||||
row = cursor.fetchone()
|
||||
|
||||
if not row:
|
||||
return None
|
||||
|
||||
# Parse JSONB as dict
|
||||
import json
|
||||
full_data = row[5] if isinstance(row[5], dict) else json.loads(row[5])
|
||||
|
||||
return {
|
||||
'filename': row[0],
|
||||
'creativex_id': row[1],
|
||||
'creativex_url': row[2],
|
||||
'quality_score': row[3],
|
||||
'box_file_id': row[4],
|
||||
'full_extraction_data': full_data,
|
||||
'extracted_at': row[6]
|
||||
}
|
||||
|
||||
finally:
|
||||
cursor.close()
|
||||
self.put_connection(conn)
|
||||
|
||||
def close(self):
|
||||
"""Close all connections in pool"""
|
||||
if self.pool:
|
||||
|
|
|
|||
|
|
@ -678,6 +678,110 @@ class Notifier:
|
|||
</div>
|
||||
</div>
|
||||
"""
|
||||
},
|
||||
'creativex_complete': {
|
||||
'subject': "✅ CreativeX Scores Extracted - {file_count} files processed",
|
||||
'html': """
|
||||
<div style="font-family: Arial, sans-serif; max-width: 900px; margin: 0 auto;">
|
||||
<div style="background-color: #9c27b0; color: white; padding: 20px; text-align: center; border-radius: 8px 8px 0 0;">
|
||||
<h1 style="margin: 0;">✅ CreativeX Score Extraction Complete</h1>
|
||||
</div>
|
||||
|
||||
<div style="background-color: #f3e5f5; border-left: 4px solid #9c27b0; padding: 15px; margin: 20px 0;">
|
||||
<p style="margin: 0;"><strong>Files Processed:</strong> {{ file_count }}</p>
|
||||
<p style="margin: 5px 0 0 0;"><strong>Scores Extracted:</strong> {{ success_count }}</p>
|
||||
<p style="margin: 5px 0 0 0;"><strong>Source:</strong> Box Folder 350605024645</p>
|
||||
</div>
|
||||
|
||||
<h3 style="margin-top: 30px; color: #333;">Extracted Scores:</h3>
|
||||
{% for score in processed_files %}
|
||||
<div style="border: 1px solid #ddd; margin: 15px 0; padding: 15px; background-color: #fafafa; border-radius: 4px;">
|
||||
<div style="background-color: #9c27b0; color: white; padding: 10px 15px; margin: -15px -15px 15px -15px; border-radius: 4px 4px 0 0;">
|
||||
<strong>{{ score.filename }}</strong>
|
||||
</div>
|
||||
<div style="padding: 10px; background-color: white; border-radius: 4px;">
|
||||
<p style="margin: 5px 0;"><span style="font-weight: bold;">Quality Score:</span> <span style="font-size: 20px; color: #9c27b0;">{{ score.quality_score }}</span></p>
|
||||
<p style="margin: 5px 0;"><span style="font-weight: bold;">CreativeX ID:</span> {{ score.creativex_id }}</p>
|
||||
{% if score.creativex_url %}<p style="margin: 5px 0;"><span style="font-weight: bold;">CreativeX URL:</span> <a href="{{ score.creativex_url }}">{{ score.creativex_url }}</a></p>{% endif %}
|
||||
<p style="margin: 5px 0;"><span style="font-weight: bold;">Box File ID:</span> {{ score.box_file_id }}</p>
|
||||
</div>
|
||||
</div>
|
||||
{% endfor %}
|
||||
|
||||
<div style="background-color: #f3e5f5; border-left: 4px solid #9c27b0; padding: 15px; margin: 20px 0;">
|
||||
<p style="margin: 0;"><strong>✓ Complete:</strong> All scores extracted and stored in database.</p>
|
||||
<p style="margin: 5px 0 0 0;"><strong>Files Removed:</strong> Processed PDFs deleted from Box folder.</p>
|
||||
</div>
|
||||
|
||||
<p style="color: #666; font-size: 12px; margin-top: 20px;">CreativeX scores stored with full JSON for future lookups.</p>
|
||||
</div>
|
||||
"""
|
||||
},
|
||||
'creativex_partial': {
|
||||
'subject': "⚠️ CreativeX Extraction Partial - {success_count}/{file_count} successful",
|
||||
'html': """
|
||||
<div style="font-family: Arial, sans-serif; max-width: 900px; margin: 0 auto;">
|
||||
<div style="background-color: #ff9800; color: white; padding: 20px; text-align: center; border-radius: 8px 8px 0 0;">
|
||||
<h1 style="margin: 0;">⚠️ CreativeX Extraction Partially Complete</h1>
|
||||
</div>
|
||||
|
||||
<div style="background-color: #fff3e0; border-left: 4px solid #ff9800; padding: 15px; margin: 20px 0;">
|
||||
<p style="margin: 0;"><strong>Total Files:</strong> {{ file_count }}</p>
|
||||
<p style="margin: 5px 0 0 0;"><strong>✓ Successful:</strong> <span style="color: #28a745;">{{ success_count }}</span></p>
|
||||
<p style="margin: 5px 0 0 0;"><strong>✗ Failed:</strong> <span style="color: #d32f2f;">{{ failed_count }}</span></p>
|
||||
<p style="margin: 5px 0 0 0;"><strong>Source:</strong> Box Folder 350605024645</p>
|
||||
</div>
|
||||
|
||||
{% if processed_files %}
|
||||
<h3 style="margin-top: 30px; color: #28a745;">✅ Successful Extractions ({{ success_count }}):</h3>
|
||||
{% for score in processed_files %}
|
||||
<div style="border: 1px solid #c8e6c9; margin: 10px 0; padding: 12px; background-color: #f1f8e9; border-radius: 4px;">
|
||||
<strong>{{ score.filename }}</strong> - Score: {{ score.quality_score }}
|
||||
</div>
|
||||
{% endfor %}
|
||||
{% endif %}
|
||||
|
||||
{% if failed_files %}
|
||||
<h3 style="margin-top: 30px; color: #d32f2f;">❌ Failed Extractions ({{ failed_count }}):</h3>
|
||||
{% for file in failed_files %}
|
||||
<div style="border: 1px solid #ffcdd2; margin: 10px 0; padding: 12px; background-color: #ffebee; border-radius: 4px;">
|
||||
<strong>{{ file.filename }}</strong>
|
||||
<br><small style="color: #666;">Error: {{ file.error }}</small>
|
||||
</div>
|
||||
{% endfor %}
|
||||
{% endif %}
|
||||
|
||||
<div style="background-color: #fff3e0; border-left: 4px solid #ff9800; padding: 15px; margin: 20px 0;">
|
||||
<p style="margin: 0;"><strong>⚠️ Action Required:</strong> Review failed extractions above.</p>
|
||||
<p style="margin: 5px 0 0 0;">Failed files remain in Box folder for retry.</p>
|
||||
</div>
|
||||
|
||||
<p style="color: #666; font-size: 12px; margin-top: 20px;">Successful scores stored in database. Failed files not deleted from Box.</p>
|
||||
</div>
|
||||
"""
|
||||
},
|
||||
'creativex_no_files': {
|
||||
'subject': "ℹ️ CreativeX Extraction - No files found",
|
||||
'html': """
|
||||
<div style="font-family: Arial, sans-serif; max-width: 900px; margin: 0 auto;">
|
||||
<div style="background-color: #607d8b; color: white; padding: 20px; text-align: center; border-radius: 8px 8px 0 0;">
|
||||
<h1 style="margin: 0;">ℹ️ CreativeX Extraction - No Files</h1>
|
||||
</div>
|
||||
|
||||
<div style="background-color: #eceff1; border-left: 4px solid #607d8b; padding: 15px; margin: 20px 0;">
|
||||
<p style="margin: 0;"><strong>Status:</strong> No PDF files found</p>
|
||||
<p style="margin: 5px 0 0 0;"><strong>Source:</strong> Box Folder 350605024645</p>
|
||||
<p style="margin: 5px 0 0 0;"><strong>Run Time:</strong> {{ timestamp }}</p>
|
||||
</div>
|
||||
|
||||
<div style="background-color: #e3f2fd; border-left: 4px solid #2196f3; padding: 15px; margin: 20px 0;">
|
||||
<p style="margin: 0;"><strong>ℹ️ Note:</strong> This is expected behavior when no new PDFs are ready for processing.</p>
|
||||
<p style="margin: 5px 0 0 0;">Upload PDFs to Box folder 350605024645 to process CreativeX scores.</p>
|
||||
</div>
|
||||
|
||||
<p style="color: #666; font-size: 12px; margin-top: 20px;">Script completed successfully with no errors.</p>
|
||||
</div>
|
||||
"""
|
||||
}
|
||||
}
|
||||
|
||||
|
|
|
|||
Loading…
Add table
Reference in a new issue