19 KiB
Adidas Brief Extraction System - Technical Documentation
Overview
The Adidas Brief Extraction System is a sophisticated CLI-based document analysis tool that leverages OpenAI's GPT-5 model with reasoning capabilities to extract structured marketing asset information from complex creative briefs, presentations, and technical documents. The system employs a multi-pass analysis pipeline with comprehensive validation to achieve superior accuracy in asset extraction.
Key Features
- GPT-5 Integration: Uses OpenAI's latest GPT-5 model with configurable reasoning effort
- LlamaParser Integration: Advanced document preprocessing for optimal content extraction
- Multi-Pass Analysis: Multi-perspective analysis with cross-validation
- Cost Tracking: Comprehensive token usage and cost management
- Schema Validation: Pydantic-based data models with JSON schema validation
- Comprehensive Logging: Detailed processing logs with error handling
System Architecture
graph TB
subgraph "Input Layer"
A[Document Files] --> B{Document Classifier}
B --> C[PowerPoint]
B --> D[Word]
B --> E[PDF]
B --> F[Excel]
end
subgraph "Preprocessing Layer"
C --> G[LlamaParser Cloud]
D --> G
E --> G
F --> G
G --> H[Structured Markdown]
end
subgraph "AI Processing Layer"
H --> I[DocumentAnalyzer]
I --> J[Multi-Perspective Analysis]
J --> K[Cross-Validation & Enhancement]
K --> L[Asset Extraction Result]
end
subgraph "Output Layer"
L --> M[CSV Generator]
M --> N[Structured CSV Output]
end
subgraph "External Services"
O[OpenAI GPT-5 API] --> J
O --> K
P[LlamaCloud API] --> G
end
subgraph "Monitoring & Logging"
Q[Token Usage Tracker] --> I
R[Cost Calculator] --> Q
S[Processing Logger] --> I
end
Core Components
1. DocumentAnalyzer Class (process_brief_enhanced.py:216)
The main orchestrator that manages the entire document processing pipeline.
Key Responsibilities:
- Model configuration and API client setup
- Document type classification
- Multi-pass analysis coordination
- Token usage tracking
- Error handling and logging
Configuration:
analyzer = DocumentAnalyzer(
model_name='gpt-5', # OpenAI model
reasoning_effort='medium' # high, medium, low, minimal
)
2. LlamaParser Integration (process_brief_enhanced.py:278-325)
Advanced document preprocessing using LlamaCloud services for optimal content extraction.
Configuration Parameters:
parser = LlamaParse(
api_key=LLAMACLOUD_API_KEY,
parse_mode="parse_page_with_agent", # Agent-based parsing
model="openai-gpt-5", # Using GPT-5 for parsing
high_res_ocr=True, # High-resolution OCR
adaptive_long_table=True, # Smart table handling
outlined_table_extraction=True, # Table structure detection
output_tables_as_HTML=True, # HTML table output
page_separator="\n\n---\n\n" # Page separation marker
)
3. Data Models
MarketingAsset (Pydantic Model - process_brief_enhanced.py:50)
class MarketingAsset(BaseModel):
title: str
status: Optional[str] = ""
category: Optional[str] = ""
media: Optional[str] = ""
asset_type: Optional[str] = ""
brand_identifier: Optional[str] = ""
format: Optional[str] = ""
review_date: Optional[str] = ""
live_date: Optional[str] = ""
end_date: Optional[str] = ""
reference_material: Optional[str] = ""
language: Optional[str] = ""
country: Optional[str] = ""
quantity: Optional[str] = "1"
page_number: Optional[str] = ""
section_context: Optional[str] = ""
priority_level: Optional[str] = ""
technical_requirements: Optional[str] = ""
creative_direction: Optional[str] = ""
approval_level: Optional[str] = ""
TokenUsage (Cost Tracking - process_brief_enhanced.py:164)
@dataclass
class TokenUsage:
input_tokens: int = 0
cached_input_tokens: int = 0
output_tokens: int = 0
def calculate_cost(self, model_name: str) -> float:
# GPT-5 Pricing (per 1M tokens)
pricing = {
'input': 2.50,
'cached_input': 1.25,
'output': 10.00
}
Process Flow
sequenceDiagram
participant CLI as CLI Interface
participant DA as DocumentAnalyzer
participant LP as LlamaParser
participant GPT5 as OpenAI GPT-5
participant CSV as CSV Generator
participant LOG as Logger
CLI->>DA: Initialize (model_name, reasoning_effort)
DA->>LOG: Setup logging
CLI->>DA: process_document_multi_pass(filepath)
Note over DA: Stage 1: Document Classification
DA->>DA: classify_document(filepath)
DA->>LOG: Log document type
Note over DA: Stage 2: Content Extraction
DA->>LP: _extract_document_content(filepath)
LP->>LP: Parse with agent-based parsing
LP->>DA: Return structured markdown
DA->>LOG: Log content extraction success
Note over DA: Stage 3: Multi-Perspective Analysis
DA->>DA: _perform_multi_perspective_analysis()
DA->>DA: _load_prompt('multi_perspective_analysis')
DA->>GPT5: responses.parse() with reasoning effort
GPT5->>DA: Return structured assets
DA->>DA: Track token usage
DA->>LOG: Log analysis completion
Note over DA: Stage 4: Cross-Validation
DA->>DA: _enhance_and_validate_results()
DA->>DA: _load_prompt('validation_analysis')
DA->>GPT5: responses.parse() for validation
GPT5->>DA: Return additional/corrected assets
DA->>DA: Merge results
DA->>LOG: Log validation results
Note over DA: Stage 5: Output Generation
DA->>CSV: Generate CSV output
CSV->>CLI: Return output filepath
DA->>LOG: Log cost summary
CLI->>CLI: Display processing summary
Detailed Processing Stages
Stage 1: Document Classification (process_brief_enhanced.py:254)
def classify_document(self, filepath: str) -> DocumentType:
extension = os.path.splitext(filepath)[1].lower()
if extension in ['.ppt', '.pptx']:
return DocumentType.POWERPOINT
elif extension in ['.doc', '.docx']:
return DocumentType.WORD
elif extension == '.pdf':
return DocumentType.PDF
elif extension in ['.xls', '.xlsx']:
return DocumentType.EXCEL
else:
return DocumentType.UNKNOWN
Stage 2: Content Extraction with LlamaParser
The system uses LlamaCloud's parsing service to convert complex documents into clean, structured markdown:
Key Features:
- Agent-based parsing: Uses AI agents to understand document structure
- High-resolution OCR: Extracts text from images and scanned documents
- Adaptive table handling: Detects and preserves table structures
- Page separation: Maintains document flow with clear page boundaries
Stage 3: Multi-Perspective Analysis
Uses external prompt files for maintainable AI instructions:
Prompt Loading System (process_brief_enhanced.py:241):
def _load_prompt(self, prompt_name: str) -> str:
prompt_path = os.path.join(os.path.dirname(__file__), 'prompts', f'{prompt_name}.txt')
with open(prompt_path, 'r', encoding='utf-8') as f:
return f.read().strip()
Analysis Prompts:
multi_perspective_analysis.txt: Comprehensive asset extraction rulessystem_multi_perspective.txt: System message for analysisvalidation_analysis.txt: Quality assurance and gap analysissystem_validation.txt: System message for validation
Stage 4: Cross-Validation & Enhancement
Implements a two-pass validation system:
- Initial Analysis: Extracts assets from multiple professional perspectives
- Validation Pass: Reviews extraction completeness and accuracy
- Gap Analysis: Identifies missing or incorrectly extracted assets
- Result Merging: Combines and deduplicates findings
Stage 5: CSV Output Generation
Generates structured CSV files with 20 standardized columns:
CSV_HEADERS = [
'title', 'status', 'category', 'media', 'asset_type',
'brand_identifier', 'format', 'review_date', 'live_date',
'end_date', 'reference_material', 'language', 'country',
'quantity', 'page_number', 'section_context', 'priority_level',
'technical_requirements', 'creative_direction', 'approval_level'
]
API Integration Details
OpenAI GPT-5 Integration
Configuration (process_brief_enhanced.py:224):
def _setup_model(self):
return OpenAI(
api_key=OPENAI_API_KEY,
timeout=3600, # 60 minutes for reasoning tasks
max_retries=2 # Reduced retries for efficiency
)
API Call Pattern:
response = self.model.responses.parse(
model=self.model_name, # 'gpt-5'
input=[
{"role": "system", "content": system_message},
{"role": "user", "content": combined_prompt}
],
reasoning={"effort": self.reasoning_effort}, # high/medium/low/minimal
text_format=AssetExtractionResult # Pydantic schema
)
LlamaCloud Integration
Document Processing:
def _extract_document_content(self, filepath: str) -> str:
parser = LlamaParse(
api_key=LLAMACLOUD_API_KEY,
parse_mode="parse_page_with_agent",
model="openai-gpt-5",
# ... additional configuration
)
result = parser.parse(filepath)
markdown_documents = result.get_markdown_documents(split_by_page=True)
combined_content = "\n\n".join([doc.text for doc in markdown_documents])
return combined_content
Configuration Management
Environment Setup
Required Dependencies:
# Core AI/ML libraries
openai>=1.0.0
llama-cloud-services>=0.6.62
google-generativeai>=0.3.0
json5>=0.9.0
# Document processing
python-pptx>=0.6.21
PyMuPDF>=1.23.0
python-docx>=0.8.11
openpyxl>=3.1.0
# Data processing
pandas>=2.0.0
numpy>=1.24.0
API Keys Configuration:
# Required API Keys in process_brief_enhanced.py
OPENAI_API_KEY = "your-openai-api-key" # Line 30
LLAMACLOUD_API_KEY = "your-llama-api-key" # Line 31
GEMINI_API_KEY = "legacy-key" # Line 29 (unused)
Virtual Environment Setup
# Create virtual environment
python -m venv venv
# Activate environment
source venv/bin/activate # Linux/Mac
venv\Scripts\activate # Windows
# Install dependencies
pip install -r requirements_enhanced.txt
Usage Examples
Basic Usage
# Default processing (medium reasoning)
python process_brief_enhanced.py document.pptx
# High reasoning effort for complex documents
python process_brief_enhanced.py complex_brief.pdf high
# Low reasoning effort for simple documents
python process_brief_enhanced.py simple_brief.docx low
Command Line Arguments
- filepath (required): Path to document file
- reasoning_effort (optional):
high,medium(default),low,minimal
Output Structure
Console Output:
=== ENHANCED BRIEF PROCESSING STARTED ===
Document Type: powerpoint
Assets Extracted: 245
Confidence Score: 0.95
Processing Notes: Multi-perspective analysis completed, Added 12 assets from validation
Output File: output/document-20250102140530.csv
=== COST ANALYSIS ===
Model Used: gpt-5
Input Tokens: 45,230
Output Tokens: 12,450
Total Cost: $0.2376
CSV Output Format:
title,status,category,media,asset_type,brand_identifier,format,review_date,...
"Display Banner - Hero","Active","Display","Image","PNG","Adidas","1920x1080","2025-01-15",...
"Social Media Post","Pending","Social","Image","JPG","Adidas","1080x1080","2025-01-10",...
Cost Management & Performance
Token Usage Tracking
Cost Calculation (process_brief_enhanced.py:176):
def calculate_cost(self, model_name: str) -> float:
pricing = OPENAI_PRICING[model_name]
input_cost = (self.input_tokens / 1_000_000) * pricing['input']
cached_cost = (self.cached_input_tokens / 1_000_000) * pricing['cached_input']
output_cost = (self.output_tokens / 1_000_000) * pricing['output']
return input_cost + cached_cost + output_cost
GPT-5 Pricing (per 1M tokens):
- Input: $2.50
- Cached Input: $1.25
- Output: $10.00
Performance Characteristics
Processing Time:
- Simple documents (1-10 pages): 30 seconds - 2 minutes
- Complex documents (10+ pages): 2-5 minutes
- Reasoning effort impact: High effort adds 50-100% processing time
Memory Usage:
- Base memory: ~200MB
- Document processing: +50-200MB depending on document size
- AI processing: +100-300MB during API calls
Typical Token Usage:
- Small brief (5 pages): 10K-20K input, 2K-5K output
- Medium brief (15 pages): 30K-50K input, 5K-10K output
- Large brief (30+ pages): 60K-100K input, 10K-20K output
Error Handling & Logging
Logging Configuration (process_brief_enhanced.py:548)
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(levelname)s - %(message)s',
handlers=[
logging.FileHandler('processing.log', mode='w'),
logging.StreamHandler(sys.stdout)
]
)
Error Handling Patterns
API Error Handling:
try:
response = self.model.responses.parse(...)
# Process successful response
except Exception as e:
logging.error(f"Multi-perspective analysis failed: {e}")
return ProcessingResult([], {}, 0.0, [f"Analysis failed: {e}"], TokenUsage())
File Processing Errors:
try:
document_content = self._extract_document_content(filepath)
except Exception as e:
logging.error(f"Content extraction failed: {e}")
return ProcessingResult([], {}, 0.0, [f"Content extraction failed: {e}"], TokenUsage())
Common Error Scenarios
- API Key Issues: Missing or invalid OpenAI/LlamaCloud API keys
- File Access: Permissions or file corruption issues
- API Timeouts: Large documents exceeding timeout limits
- Rate Limits: API rate limiting from OpenAI or LlamaCloud
- Memory Issues: Very large documents causing memory exhaustion
System Comparison & Benchmarking
Comparison Script (compare_systems.py)
Benchmarks the enhanced system against baseline implementations:
Key Metrics:
- Asset extraction count
- Processing time
- Cost per document
- Accuracy assessment
Usage:
python compare_systems.py
Expected Improvements:
- 50-300% increase in asset extraction completeness
- Superior technical specification accuracy
- Better handling of multi-language/multi-market requirements
Extension Points
1. Adding New Document Types
Extend DocumentType enum:
class DocumentType(Enum):
POWERPOINT = "powerpoint"
WORD = "word"
PDF = "pdf"
EXCEL = "excel"
# Add new types here
GOOGLE_SLIDES = "google_slides"
UNKNOWN = "unknown"
Implement classification logic:
def classify_document(self, filepath: str) -> DocumentType:
extension = os.path.splitext(filepath)[1].lower()
# Add new extension handling
2. Custom Output Formats
Extend output generation:
def generate_json_output(self, results: ProcessingResult) -> str:
"""Generate JSON output format"""
assets_dict = [asset for asset in results.raw_data]
return json.dumps(assets_dict, indent=2)
3. Additional AI Models
Support multiple AI providers:
def _setup_model(self):
if self.model_name.startswith('gpt-'):
return self._setup_openai()
elif self.model_name.startswith('claude-'):
return self._setup_anthropic()
# Add other providers
4. Custom Validation Rules
Extend validation logic:
def _custom_validation_rules(self, assets: List[Dict]) -> List[str]:
"""Apply business-specific validation rules"""
issues = []
for asset in assets:
# Custom validation logic
if not asset.get('format') and asset.get('media') == 'Image':
issues.append(f"Image asset '{asset['title']}' missing format specification")
return issues
Security Considerations
API Key Management
- Store API keys in environment variables
- Use secure key rotation practices
- Implement key validation before processing
- Consider using secret management services
Document Security
- Validate file types before processing
- Implement file size limits
- Scan for malicious content
- Ensure secure temporary file handling
Data Privacy
- Log sanitization (avoid logging sensitive content)
- Secure deletion of temporary files
- Consider data residency requirements for API calls
- Implement audit trails for document processing
Troubleshooting Guide
Common Issues
1. "OPENAI_API_KEY not set" Error
# Solution: Set API key in process_brief_enhanced.py line 30
OPENAI_API_KEY = "your-actual-api-key"
2. LlamaParser Import Error
# Solution: Install llama-cloud-services
pip install llama-cloud-services>=0.6.62
3. Timeout Errors on Large Documents
# Solution: Use lower reasoning effort
python process_brief_enhanced.py large_doc.pdf low
4. High Processing Costs
# Solution: Use minimal reasoning for simple documents
python process_brief_enhanced.py simple_doc.docx minimal
Performance Optimization
1. Reduce Processing Time:
- Use
loworminimalreasoning effort for simple documents - Implement document pre-filtering
- Consider document splitting for very large files
2. Reduce Costs:
- Monitor token usage with cost summaries
- Use cached inputs when possible
- Optimize prompt length and complexity
3. Improve Accuracy:
- Use
highreasoning effort for complex documents - Customize prompts for specific document types
- Implement domain-specific validation rules
Future Enhancements
Planned Features
- Batch Processing: Process multiple documents simultaneously
- Document Templates: Predefined extraction rules for common brief types
- Quality Scoring: Automated confidence scoring for extractions
- Export Formats: Support for JSON, XML, and database outputs
- Integration APIs: REST API for system integration
- Real-time Monitoring: Processing metrics and alerting
Technical Roadmap
- Multi-threading: Parallel processing for large document batches
- Caching Layer: Redis-based result caching
- Database Integration: Direct database output options
- Containerization: Docker deployment support
- Cloud Deployment: AWS/GCP/Azure deployment options
Conclusion
The Adidas Brief Extraction System represents a sophisticated approach to automated document analysis, combining state-of-the-art AI models with robust engineering practices. The system's modular architecture, comprehensive error handling, and detailed monitoring make it suitable for production deployment in enterprise environments.
The multi-pass analysis pipeline ensures high accuracy in asset extraction, while the cost tracking and performance optimization features make it economically viable for large-scale document processing workflows.
For questions or support, refer to the project's logging output and error handling mechanisms, which provide detailed debugging information for troubleshooting and optimization.