adi-o3-multipass/docs/adidas_brief_extraction_tech_documentation.md

19 KiB

Adidas Brief Extraction System - Technical Documentation

Overview

The Adidas Brief Extraction System is a sophisticated CLI-based document analysis tool that leverages OpenAI's GPT-5 model with reasoning capabilities to extract structured marketing asset information from complex creative briefs, presentations, and technical documents. The system employs a multi-pass analysis pipeline with comprehensive validation to achieve superior accuracy in asset extraction.

Key Features

  • GPT-5 Integration: Uses OpenAI's latest GPT-5 model with configurable reasoning effort
  • LlamaParser Integration: Advanced document preprocessing for optimal content extraction
  • Multi-Pass Analysis: Multi-perspective analysis with cross-validation
  • Cost Tracking: Comprehensive token usage and cost management
  • Schema Validation: Pydantic-based data models with JSON schema validation
  • Comprehensive Logging: Detailed processing logs with error handling

System Architecture

graph TB
    subgraph "Input Layer"
        A[Document Files] --> B{Document Classifier}
        B --> C[PowerPoint]
        B --> D[Word]
        B --> E[PDF]
        B --> F[Excel]
    end
    
    subgraph "Preprocessing Layer"
        C --> G[LlamaParser Cloud]
        D --> G
        E --> G
        F --> G
        G --> H[Structured Markdown]
    end
    
    subgraph "AI Processing Layer"
        H --> I[DocumentAnalyzer]
        I --> J[Multi-Perspective Analysis]
        J --> K[Cross-Validation & Enhancement]
        K --> L[Asset Extraction Result]
    end
    
    subgraph "Output Layer"
        L --> M[CSV Generator]
        M --> N[Structured CSV Output]
    end
    
    subgraph "External Services"
        O[OpenAI GPT-5 API] --> J
        O --> K
        P[LlamaCloud API] --> G
    end
    
    subgraph "Monitoring & Logging"
        Q[Token Usage Tracker] --> I
        R[Cost Calculator] --> Q
        S[Processing Logger] --> I
    end

Core Components

1. DocumentAnalyzer Class (process_brief_enhanced.py:216)

The main orchestrator that manages the entire document processing pipeline.

Key Responsibilities:

  • Model configuration and API client setup
  • Document type classification
  • Multi-pass analysis coordination
  • Token usage tracking
  • Error handling and logging

Configuration:

analyzer = DocumentAnalyzer(
    model_name='gpt-5',           # OpenAI model
    reasoning_effort='medium'      # high, medium, low, minimal
)

2. LlamaParser Integration (process_brief_enhanced.py:278-325)

Advanced document preprocessing using LlamaCloud services for optimal content extraction.

Configuration Parameters:

parser = LlamaParse(
    api_key=LLAMACLOUD_API_KEY,
    parse_mode="parse_page_with_agent",    # Agent-based parsing
    model="openai-gpt-5",                  # Using GPT-5 for parsing
    high_res_ocr=True,                     # High-resolution OCR
    adaptive_long_table=True,              # Smart table handling
    outlined_table_extraction=True,        # Table structure detection
    output_tables_as_HTML=True,            # HTML table output
    page_separator="\n\n---\n\n"           # Page separation marker
)

3. Data Models

MarketingAsset (Pydantic Model - process_brief_enhanced.py:50)

class MarketingAsset(BaseModel):
    title: str
    status: Optional[str] = ""
    category: Optional[str] = ""
    media: Optional[str] = ""
    asset_type: Optional[str] = ""
    brand_identifier: Optional[str] = ""
    format: Optional[str] = ""
    review_date: Optional[str] = ""
    live_date: Optional[str] = ""
    end_date: Optional[str] = ""
    reference_material: Optional[str] = ""
    language: Optional[str] = ""
    country: Optional[str] = ""
    quantity: Optional[str] = "1"
    page_number: Optional[str] = ""
    section_context: Optional[str] = ""
    priority_level: Optional[str] = ""
    technical_requirements: Optional[str] = ""
    creative_direction: Optional[str] = ""
    approval_level: Optional[str] = ""

TokenUsage (Cost Tracking - process_brief_enhanced.py:164)

@dataclass
class TokenUsage:
    input_tokens: int = 0
    cached_input_tokens: int = 0
    output_tokens: int = 0
    
    def calculate_cost(self, model_name: str) -> float:
        # GPT-5 Pricing (per 1M tokens)
        pricing = {
            'input': 2.50,
            'cached_input': 1.25,
            'output': 10.00
        }

Process Flow

sequenceDiagram
    participant CLI as CLI Interface
    participant DA as DocumentAnalyzer
    participant LP as LlamaParser
    participant GPT5 as OpenAI GPT-5
    participant CSV as CSV Generator
    participant LOG as Logger

    CLI->>DA: Initialize (model_name, reasoning_effort)
    DA->>LOG: Setup logging
    CLI->>DA: process_document_multi_pass(filepath)
    
    Note over DA: Stage 1: Document Classification
    DA->>DA: classify_document(filepath)
    DA->>LOG: Log document type
    
    Note over DA: Stage 2: Content Extraction
    DA->>LP: _extract_document_content(filepath)
    LP->>LP: Parse with agent-based parsing
    LP->>DA: Return structured markdown
    DA->>LOG: Log content extraction success
    
    Note over DA: Stage 3: Multi-Perspective Analysis
    DA->>DA: _perform_multi_perspective_analysis()
    DA->>DA: _load_prompt('multi_perspective_analysis')
    DA->>GPT5: responses.parse() with reasoning effort
    GPT5->>DA: Return structured assets
    DA->>DA: Track token usage
    DA->>LOG: Log analysis completion
    
    Note over DA: Stage 4: Cross-Validation
    DA->>DA: _enhance_and_validate_results()
    DA->>DA: _load_prompt('validation_analysis')
    DA->>GPT5: responses.parse() for validation
    GPT5->>DA: Return additional/corrected assets
    DA->>DA: Merge results
    DA->>LOG: Log validation results
    
    Note over DA: Stage 5: Output Generation
    DA->>CSV: Generate CSV output
    CSV->>CLI: Return output filepath
    DA->>LOG: Log cost summary
    CLI->>CLI: Display processing summary

Detailed Processing Stages

Stage 1: Document Classification (process_brief_enhanced.py:254)

def classify_document(self, filepath: str) -> DocumentType:
    extension = os.path.splitext(filepath)[1].lower()
    
    if extension in ['.ppt', '.pptx']:
        return DocumentType.POWERPOINT
    elif extension in ['.doc', '.docx']:
        return DocumentType.WORD
    elif extension == '.pdf':
        return DocumentType.PDF
    elif extension in ['.xls', '.xlsx']:
        return DocumentType.EXCEL
    else:
        return DocumentType.UNKNOWN

Stage 2: Content Extraction with LlamaParser

The system uses LlamaCloud's parsing service to convert complex documents into clean, structured markdown:

Key Features:

  • Agent-based parsing: Uses AI agents to understand document structure
  • High-resolution OCR: Extracts text from images and scanned documents
  • Adaptive table handling: Detects and preserves table structures
  • Page separation: Maintains document flow with clear page boundaries

Stage 3: Multi-Perspective Analysis

Uses external prompt files for maintainable AI instructions:

Prompt Loading System (process_brief_enhanced.py:241):

def _load_prompt(self, prompt_name: str) -> str:
    prompt_path = os.path.join(os.path.dirname(__file__), 'prompts', f'{prompt_name}.txt')
    with open(prompt_path, 'r', encoding='utf-8') as f:
        return f.read().strip()

Analysis Prompts:

  • multi_perspective_analysis.txt: Comprehensive asset extraction rules
  • system_multi_perspective.txt: System message for analysis
  • validation_analysis.txt: Quality assurance and gap analysis
  • system_validation.txt: System message for validation

Stage 4: Cross-Validation & Enhancement

Implements a two-pass validation system:

  1. Initial Analysis: Extracts assets from multiple professional perspectives
  2. Validation Pass: Reviews extraction completeness and accuracy
  3. Gap Analysis: Identifies missing or incorrectly extracted assets
  4. Result Merging: Combines and deduplicates findings

Stage 5: CSV Output Generation

Generates structured CSV files with 20 standardized columns:

CSV_HEADERS = [
    'title', 'status', 'category', 'media', 'asset_type', 
    'brand_identifier', 'format', 'review_date', 'live_date', 
    'end_date', 'reference_material', 'language', 'country', 
    'quantity', 'page_number', 'section_context', 'priority_level',
    'technical_requirements', 'creative_direction', 'approval_level'
]

API Integration Details

OpenAI GPT-5 Integration

Configuration (process_brief_enhanced.py:224):

def _setup_model(self):
    return OpenAI(
        api_key=OPENAI_API_KEY,
        timeout=3600,    # 60 minutes for reasoning tasks
        max_retries=2    # Reduced retries for efficiency
    )

API Call Pattern:

response = self.model.responses.parse(
    model=self.model_name,                    # 'gpt-5'
    input=[
        {"role": "system", "content": system_message},
        {"role": "user", "content": combined_prompt}
    ],
    reasoning={"effort": self.reasoning_effort}, # high/medium/low/minimal
    text_format=AssetExtractionResult           # Pydantic schema
)

LlamaCloud Integration

Document Processing:

def _extract_document_content(self, filepath: str) -> str:
    parser = LlamaParse(
        api_key=LLAMACLOUD_API_KEY,
        parse_mode="parse_page_with_agent",
        model="openai-gpt-5",
        # ... additional configuration
    )
    
    result = parser.parse(filepath)
    markdown_documents = result.get_markdown_documents(split_by_page=True)
    combined_content = "\n\n".join([doc.text for doc in markdown_documents])
    
    return combined_content

Configuration Management

Environment Setup

Required Dependencies:

# Core AI/ML libraries
openai>=1.0.0
llama-cloud-services>=0.6.62
google-generativeai>=0.3.0
json5>=0.9.0

# Document processing
python-pptx>=0.6.21
PyMuPDF>=1.23.0
python-docx>=0.8.11
openpyxl>=3.1.0

# Data processing
pandas>=2.0.0
numpy>=1.24.0

API Keys Configuration:

# Required API Keys in process_brief_enhanced.py
OPENAI_API_KEY = "your-openai-api-key"      # Line 30
LLAMACLOUD_API_KEY = "your-llama-api-key"   # Line 31
GEMINI_API_KEY = "legacy-key"               # Line 29 (unused)

Virtual Environment Setup

# Create virtual environment
python -m venv venv

# Activate environment
source venv/bin/activate  # Linux/Mac
venv\Scripts\activate     # Windows

# Install dependencies
pip install -r requirements_enhanced.txt

Usage Examples

Basic Usage

# Default processing (medium reasoning)
python process_brief_enhanced.py document.pptx

# High reasoning effort for complex documents
python process_brief_enhanced.py complex_brief.pdf high

# Low reasoning effort for simple documents
python process_brief_enhanced.py simple_brief.docx low

Command Line Arguments

  1. filepath (required): Path to document file
  2. reasoning_effort (optional): high, medium (default), low, minimal

Output Structure

Console Output:

=== ENHANCED BRIEF PROCESSING STARTED ===
Document Type: powerpoint
Assets Extracted: 245
Confidence Score: 0.95
Processing Notes: Multi-perspective analysis completed, Added 12 assets from validation
Output File: output/document-20250102140530.csv

=== COST ANALYSIS ===
Model Used: gpt-5
Input Tokens: 45,230
Output Tokens: 12,450
Total Cost: $0.2376

CSV Output Format:

title,status,category,media,asset_type,brand_identifier,format,review_date,...
"Display Banner - Hero","Active","Display","Image","PNG","Adidas","1920x1080","2025-01-15",...
"Social Media Post","Pending","Social","Image","JPG","Adidas","1080x1080","2025-01-10",...

Cost Management & Performance

Token Usage Tracking

Cost Calculation (process_brief_enhanced.py:176):

def calculate_cost(self, model_name: str) -> float:
    pricing = OPENAI_PRICING[model_name]
    
    input_cost = (self.input_tokens / 1_000_000) * pricing['input']
    cached_cost = (self.cached_input_tokens / 1_000_000) * pricing['cached_input']
    output_cost = (self.output_tokens / 1_000_000) * pricing['output']
    
    return input_cost + cached_cost + output_cost

GPT-5 Pricing (per 1M tokens):

  • Input: $2.50
  • Cached Input: $1.25
  • Output: $10.00

Performance Characteristics

Processing Time:

  • Simple documents (1-10 pages): 30 seconds - 2 minutes
  • Complex documents (10+ pages): 2-5 minutes
  • Reasoning effort impact: High effort adds 50-100% processing time

Memory Usage:

  • Base memory: ~200MB
  • Document processing: +50-200MB depending on document size
  • AI processing: +100-300MB during API calls

Typical Token Usage:

  • Small brief (5 pages): 10K-20K input, 2K-5K output
  • Medium brief (15 pages): 30K-50K input, 5K-10K output
  • Large brief (30+ pages): 60K-100K input, 10K-20K output

Error Handling & Logging

Logging Configuration (process_brief_enhanced.py:548)

logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler('processing.log', mode='w'),
        logging.StreamHandler(sys.stdout)
    ]
)

Error Handling Patterns

API Error Handling:

try:
    response = self.model.responses.parse(...)
    # Process successful response
except Exception as e:
    logging.error(f"Multi-perspective analysis failed: {e}")
    return ProcessingResult([], {}, 0.0, [f"Analysis failed: {e}"], TokenUsage())

File Processing Errors:

try:
    document_content = self._extract_document_content(filepath)
except Exception as e:
    logging.error(f"Content extraction failed: {e}")
    return ProcessingResult([], {}, 0.0, [f"Content extraction failed: {e}"], TokenUsage())

Common Error Scenarios

  1. API Key Issues: Missing or invalid OpenAI/LlamaCloud API keys
  2. File Access: Permissions or file corruption issues
  3. API Timeouts: Large documents exceeding timeout limits
  4. Rate Limits: API rate limiting from OpenAI or LlamaCloud
  5. Memory Issues: Very large documents causing memory exhaustion

System Comparison & Benchmarking

Comparison Script (compare_systems.py)

Benchmarks the enhanced system against baseline implementations:

Key Metrics:

  • Asset extraction count
  • Processing time
  • Cost per document
  • Accuracy assessment

Usage:

python compare_systems.py

Expected Improvements:

  • 50-300% increase in asset extraction completeness
  • Superior technical specification accuracy
  • Better handling of multi-language/multi-market requirements

Extension Points

1. Adding New Document Types

Extend DocumentType enum:

class DocumentType(Enum):
    POWERPOINT = "powerpoint"
    WORD = "word"
    PDF = "pdf"
    EXCEL = "excel"
    # Add new types here
    GOOGLE_SLIDES = "google_slides"
    UNKNOWN = "unknown"

Implement classification logic:

def classify_document(self, filepath: str) -> DocumentType:
    extension = os.path.splitext(filepath)[1].lower()
    # Add new extension handling

2. Custom Output Formats

Extend output generation:

def generate_json_output(self, results: ProcessingResult) -> str:
    """Generate JSON output format"""
    assets_dict = [asset for asset in results.raw_data]
    return json.dumps(assets_dict, indent=2)

3. Additional AI Models

Support multiple AI providers:

def _setup_model(self):
    if self.model_name.startswith('gpt-'):
        return self._setup_openai()
    elif self.model_name.startswith('claude-'):
        return self._setup_anthropic()
    # Add other providers

4. Custom Validation Rules

Extend validation logic:

def _custom_validation_rules(self, assets: List[Dict]) -> List[str]:
    """Apply business-specific validation rules"""
    issues = []
    for asset in assets:
        # Custom validation logic
        if not asset.get('format') and asset.get('media') == 'Image':
            issues.append(f"Image asset '{asset['title']}' missing format specification")
    return issues

Security Considerations

API Key Management

  • Store API keys in environment variables
  • Use secure key rotation practices
  • Implement key validation before processing
  • Consider using secret management services

Document Security

  • Validate file types before processing
  • Implement file size limits
  • Scan for malicious content
  • Ensure secure temporary file handling

Data Privacy

  • Log sanitization (avoid logging sensitive content)
  • Secure deletion of temporary files
  • Consider data residency requirements for API calls
  • Implement audit trails for document processing

Troubleshooting Guide

Common Issues

1. "OPENAI_API_KEY not set" Error

# Solution: Set API key in process_brief_enhanced.py line 30
OPENAI_API_KEY = "your-actual-api-key"

2. LlamaParser Import Error

# Solution: Install llama-cloud-services
pip install llama-cloud-services>=0.6.62

3. Timeout Errors on Large Documents

# Solution: Use lower reasoning effort
python process_brief_enhanced.py large_doc.pdf low

4. High Processing Costs

# Solution: Use minimal reasoning for simple documents
python process_brief_enhanced.py simple_doc.docx minimal

Performance Optimization

1. Reduce Processing Time:

  • Use low or minimal reasoning effort for simple documents
  • Implement document pre-filtering
  • Consider document splitting for very large files

2. Reduce Costs:

  • Monitor token usage with cost summaries
  • Use cached inputs when possible
  • Optimize prompt length and complexity

3. Improve Accuracy:

  • Use high reasoning effort for complex documents
  • Customize prompts for specific document types
  • Implement domain-specific validation rules

Future Enhancements

Planned Features

  1. Batch Processing: Process multiple documents simultaneously
  2. Document Templates: Predefined extraction rules for common brief types
  3. Quality Scoring: Automated confidence scoring for extractions
  4. Export Formats: Support for JSON, XML, and database outputs
  5. Integration APIs: REST API for system integration
  6. Real-time Monitoring: Processing metrics and alerting

Technical Roadmap

  1. Multi-threading: Parallel processing for large document batches
  2. Caching Layer: Redis-based result caching
  3. Database Integration: Direct database output options
  4. Containerization: Docker deployment support
  5. Cloud Deployment: AWS/GCP/Azure deployment options

Conclusion

The Adidas Brief Extraction System represents a sophisticated approach to automated document analysis, combining state-of-the-art AI models with robust engineering practices. The system's modular architecture, comprehensive error handling, and detailed monitoring make it suitable for production deployment in enterprise environments.

The multi-pass analysis pipeline ensures high accuracy in asset extraction, while the cost tracking and performance optimization features make it economically viable for large-scale document processing workflows.

For questions or support, refer to the project's logging output and error handling mechanisms, which provide detailed debugging information for troubleshooting and optimization.