michael cdc4653bf6 added anthropic and google as LLLms, now multiple LLMs do primary analyssi with a second consolidating and deduplicating step, all steps LLM configurable. Modified schema to limit expansion fields. Modified prompts accordingly

2025-09-13 00:28:07 -05:00

19 KiB

Raw Permalink Blame History

Adidas Brief Extraction System - Technical Documentation

Overview

The Adidas Brief Extraction System is a sophisticated CLI-based document analysis tool that leverages OpenAI's GPT-5 model with reasoning capabilities to extract structured marketing asset information from complex creative briefs, presentations, and technical documents. The system employs a multi-pass analysis pipeline with comprehensive validation to achieve superior accuracy in asset extraction.

Key Features

GPT-5 Integration: Uses OpenAI's latest GPT-5 model with configurable reasoning effort
LlamaParser Integration: Advanced document preprocessing for optimal content extraction
Multi-Pass Analysis: Multi-perspective analysis with cross-validation
Cost Tracking: Comprehensive token usage and cost management
Schema Validation: Pydantic-based data models with JSON schema validation
Comprehensive Logging: Detailed processing logs with error handling

System Architecture

graph TB
    subgraph "Input Layer"
        A[Document Files] --> B{Document Classifier}
        B --> C[PowerPoint]
        B --> D[Word]
        B --> E[PDF]
        B --> F[Excel]
    end
    
    subgraph "Preprocessing Layer"
        C --> G[LlamaParser Cloud]
        D --> G
        E --> G
        F --> G
        G --> H[Structured Markdown]
    end
    
    subgraph "AI Processing Layer"
        H --> I[DocumentAnalyzer]
        I --> J[Multi-Perspective Analysis]
        J --> K[Cross-Validation & Enhancement]
        K --> L[Asset Extraction Result]
    end
    
    subgraph "Output Layer"
        L --> M[CSV Generator]
        M --> N[Structured CSV Output]
    end
    
    subgraph "External Services"
        O[OpenAI GPT-5 API] --> J
        O --> K
        P[LlamaCloud API] --> G
    end
    
    subgraph "Monitoring & Logging"
        Q[Token Usage Tracker] --> I
        R[Cost Calculator] --> Q
        S[Processing Logger] --> I
    end

Core Components

1. DocumentAnalyzer Class (`process_brief_enhanced.py:216`)

The main orchestrator that manages the entire document processing pipeline.

Key Responsibilities:

Model configuration and API client setup
Document type classification
Multi-pass analysis coordination
Token usage tracking
Error handling and logging

Configuration:

analyzer = DocumentAnalyzer(
    model_name='gpt-5',           # OpenAI model
    reasoning_effort='medium'      # high, medium, low, minimal
)

2. LlamaParser Integration (`process_brief_enhanced.py:278-325`)

Advanced document preprocessing using LlamaCloud services for optimal content extraction.

Configuration Parameters:

parser = LlamaParse(
    api_key=LLAMACLOUD_API_KEY,
    parse_mode="parse_page_with_agent",    # Agent-based parsing
    model="openai-gpt-5",                  # Using GPT-5 for parsing
    high_res_ocr=True,                     # High-resolution OCR
    adaptive_long_table=True,              # Smart table handling
    outlined_table_extraction=True,        # Table structure detection
    output_tables_as_HTML=True,            # HTML table output
    page_separator="\n\n---\n\n"           # Page separation marker
)

3. Data Models

MarketingAsset (Pydantic Model - `process_brief_enhanced.py:50`)

class MarketingAsset(BaseModel):
    title: str
    status: Optional[str] = ""
    category: Optional[str] = ""
    media: Optional[str] = ""
    asset_type: Optional[str] = ""
    brand_identifier: Optional[str] = ""
    format: Optional[str] = ""
    review_date: Optional[str] = ""
    live_date: Optional[str] = ""
    end_date: Optional[str] = ""
    reference_material: Optional[str] = ""
    language: Optional[str] = ""
    country: Optional[str] = ""
    quantity: Optional[str] = "1"
    page_number: Optional[str] = ""
    section_context: Optional[str] = ""
    priority_level: Optional[str] = ""
    technical_requirements: Optional[str] = ""
    creative_direction: Optional[str] = ""
    approval_level: Optional[str] = ""

TokenUsage (Cost Tracking - `process_brief_enhanced.py:164`)

@dataclass
class TokenUsage:
    input_tokens: int = 0
    cached_input_tokens: int = 0
    output_tokens: int = 0
    
    def calculate_cost(self, model_name: str) -> float:
        # GPT-5 Pricing (per 1M tokens)
        pricing = {
            'input': 2.50,
            'cached_input': 1.25,
            'output': 10.00
        }

Process Flow

sequenceDiagram
    participant CLI as CLI Interface
    participant DA as DocumentAnalyzer
    participant LP as LlamaParser
    participant GPT5 as OpenAI GPT-5
    participant CSV as CSV Generator
    participant LOG as Logger

    CLI->>DA: Initialize (model_name, reasoning_effort)
    DA->>LOG: Setup logging
    CLI->>DA: process_document_multi_pass(filepath)
    
    Note over DA: Stage 1: Document Classification
    DA->>DA: classify_document(filepath)
    DA->>LOG: Log document type
    
    Note over DA: Stage 2: Content Extraction
    DA->>LP: _extract_document_content(filepath)
    LP->>LP: Parse with agent-based parsing
    LP->>DA: Return structured markdown
    DA->>LOG: Log content extraction success
    
    Note over DA: Stage 3: Multi-Perspective Analysis
    DA->>DA: _perform_multi_perspective_analysis()
    DA->>DA: _load_prompt('multi_perspective_analysis')
    DA->>GPT5: responses.parse() with reasoning effort
    GPT5->>DA: Return structured assets
    DA->>DA: Track token usage
    DA->>LOG: Log analysis completion
    
    Note over DA: Stage 4: Cross-Validation
    DA->>DA: _enhance_and_validate_results()
    DA->>DA: _load_prompt('validation_analysis')
    DA->>GPT5: responses.parse() for validation
    GPT5->>DA: Return additional/corrected assets
    DA->>DA: Merge results
    DA->>LOG: Log validation results
    
    Note over DA: Stage 5: Output Generation
    DA->>CSV: Generate CSV output
    CSV->>CLI: Return output filepath
    DA->>LOG: Log cost summary
    CLI->>CLI: Display processing summary

Detailed Processing Stages

Stage 1: Document Classification (`process_brief_enhanced.py:254`)

def classify_document(self, filepath: str) -> DocumentType:
    extension = os.path.splitext(filepath)[1].lower()
    
    if extension in ['.ppt', '.pptx']:
        return DocumentType.POWERPOINT
    elif extension in ['.doc', '.docx']:
        return DocumentType.WORD
    elif extension == '.pdf':
        return DocumentType.PDF
    elif extension in ['.xls', '.xlsx']:
        return DocumentType.EXCEL
    else:
        return DocumentType.UNKNOWN

Stage 2: Content Extraction with LlamaParser

The system uses LlamaCloud's parsing service to convert complex documents into clean, structured markdown:

Key Features:

Agent-based parsing: Uses AI agents to understand document structure
High-resolution OCR: Extracts text from images and scanned documents
Adaptive table handling: Detects and preserves table structures
Page separation: Maintains document flow with clear page boundaries

Stage 3: Multi-Perspective Analysis

Uses external prompt files for maintainable AI instructions:

Prompt Loading System (process_brief_enhanced.py:241):

def _load_prompt(self, prompt_name: str) -> str:
    prompt_path = os.path.join(os.path.dirname(__file__), 'prompts', f'{prompt_name}.txt')
    with open(prompt_path, 'r', encoding='utf-8') as f:
        return f.read().strip()

Analysis Prompts:

multi_perspective_analysis.txt: Comprehensive asset extraction rules
system_multi_perspective.txt: System message for analysis
validation_analysis.txt: Quality assurance and gap analysis
system_validation.txt: System message for validation

Stage 4: Cross-Validation & Enhancement

Implements a two-pass validation system:

Initial Analysis: Extracts assets from multiple professional perspectives
Validation Pass: Reviews extraction completeness and accuracy
Gap Analysis: Identifies missing or incorrectly extracted assets
Result Merging: Combines and deduplicates findings

Stage 5: CSV Output Generation

Generates structured CSV files with 20 standardized columns:

CSV_HEADERS = [
    'title', 'status', 'category', 'media', 'asset_type', 
    'brand_identifier', 'format', 'review_date', 'live_date', 
    'end_date', 'reference_material', 'language', 'country', 
    'quantity', 'page_number', 'section_context', 'priority_level',
    'technical_requirements', 'creative_direction', 'approval_level'
]

API Integration Details

OpenAI GPT-5 Integration

Configuration (process_brief_enhanced.py:224):

def _setup_model(self):
    return OpenAI(
        api_key=OPENAI_API_KEY,
        timeout=3600,    # 60 minutes for reasoning tasks
        max_retries=2    # Reduced retries for efficiency
    )

API Call Pattern:

response = self.model.responses.parse(
    model=self.model_name,                    # 'gpt-5'
    input=[
        {"role": "system", "content": system_message},
        {"role": "user", "content": combined_prompt}
    ],
    reasoning={"effort": self.reasoning_effort}, # high/medium/low/minimal
    text_format=AssetExtractionResult           # Pydantic schema
)

LlamaCloud Integration

Document Processing:

def _extract_document_content(self, filepath: str) -> str:
    parser = LlamaParse(
        api_key=LLAMACLOUD_API_KEY,
        parse_mode="parse_page_with_agent",
        model="openai-gpt-5",
        # ... additional configuration
    )
    
    result = parser.parse(filepath)
    markdown_documents = result.get_markdown_documents(split_by_page=True)
    combined_content = "\n\n".join([doc.text for doc in markdown_documents])
    
    return combined_content

Configuration Management

Environment Setup

Required Dependencies:

# Core AI/ML libraries
openai>=1.0.0
llama-cloud-services>=0.6.62
google-generativeai>=0.3.0
json5>=0.9.0

# Document processing
python-pptx>=0.6.21
PyMuPDF>=1.23.0
python-docx>=0.8.11
openpyxl>=3.1.0

# Data processing
pandas>=2.0.0
numpy>=1.24.0

API Keys Configuration:

# Required API Keys in process_brief_enhanced.py
OPENAI_API_KEY = "your-openai-api-key"      # Line 30
LLAMACLOUD_API_KEY = "your-llama-api-key"   # Line 31
GEMINI_API_KEY = "legacy-key"               # Line 29 (unused)

Virtual Environment Setup

# Create virtual environment
python -m venv venv

# Activate environment
source venv/bin/activate  # Linux/Mac
venv\Scripts\activate     # Windows

# Install dependencies
pip install -r requirements_enhanced.txt

Usage Examples

Basic Usage

# Default processing (medium reasoning)
python process_brief_enhanced.py document.pptx

# High reasoning effort for complex documents
python process_brief_enhanced.py complex_brief.pdf high

# Low reasoning effort for simple documents
python process_brief_enhanced.py simple_brief.docx low

Command Line Arguments

filepath (required): Path to document file
reasoning_effort (optional): high, medium (default), low, minimal

Output Structure

Console Output:

=== ENHANCED BRIEF PROCESSING STARTED ===
Document Type: powerpoint
Assets Extracted: 245
Confidence Score: 0.95
Processing Notes: Multi-perspective analysis completed, Added 12 assets from validation
Output File: output/document-20250102140530.csv

=== COST ANALYSIS ===
Model Used: gpt-5
Input Tokens: 45,230
Output Tokens: 12,450
Total Cost: $0.2376

CSV Output Format:

title,status,category,media,asset_type,brand_identifier,format,review_date,...
"Display Banner - Hero","Active","Display","Image","PNG","Adidas","1920x1080","2025-01-15",...
"Social Media Post","Pending","Social","Image","JPG","Adidas","1080x1080","2025-01-10",...

Cost Management & Performance

Token Usage Tracking

Cost Calculation (process_brief_enhanced.py:176):

def calculate_cost(self, model_name: str) -> float:
    pricing = OPENAI_PRICING[model_name]
    
    input_cost = (self.input_tokens / 1_000_000) * pricing['input']
    cached_cost = (self.cached_input_tokens / 1_000_000) * pricing['cached_input']
    output_cost = (self.output_tokens / 1_000_000) * pricing['output']
    
    return input_cost + cached_cost + output_cost

GPT-5 Pricing (per 1M tokens):

Input: $2.50
Cached Input: $1.25
Output: $10.00

Performance Characteristics

Processing Time:

Simple documents (1-10 pages): 30 seconds - 2 minutes
Complex documents (10+ pages): 2-5 minutes
Reasoning effort impact: High effort adds 50-100% processing time

Memory Usage:

Base memory: ~200MB
Document processing: +50-200MB depending on document size
AI processing: +100-300MB during API calls

Typical Token Usage:

Small brief (5 pages): 10K-20K input, 2K-5K output
Medium brief (15 pages): 30K-50K input, 5K-10K output
Large brief (30+ pages): 60K-100K input, 10K-20K output

Error Handling & Logging

Logging Configuration (`process_brief_enhanced.py:548`)

logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler('processing.log', mode='w'),
        logging.StreamHandler(sys.stdout)
    ]
)

Error Handling Patterns

API Error Handling:

try:
    response = self.model.responses.parse(...)
    # Process successful response
except Exception as e:
    logging.error(f"Multi-perspective analysis failed: {e}")
    return ProcessingResult([], {}, 0.0, [f"Analysis failed: {e}"], TokenUsage())

File Processing Errors:

try:
    document_content = self._extract_document_content(filepath)
except Exception as e:
    logging.error(f"Content extraction failed: {e}")
    return ProcessingResult([], {}, 0.0, [f"Content extraction failed: {e}"], TokenUsage())

Common Error Scenarios

API Key Issues: Missing or invalid OpenAI/LlamaCloud API keys
File Access: Permissions or file corruption issues
API Timeouts: Large documents exceeding timeout limits
Rate Limits: API rate limiting from OpenAI or LlamaCloud
Memory Issues: Very large documents causing memory exhaustion

System Comparison & Benchmarking

Comparison Script (`compare_systems.py`)

Benchmarks the enhanced system against baseline implementations:

Key Metrics:

Asset extraction count
Processing time
Cost per document
Accuracy assessment

Usage:

python compare_systems.py

Expected Improvements:

50-300% increase in asset extraction completeness
Superior technical specification accuracy
Better handling of multi-language/multi-market requirements

Extension Points

1. Adding New Document Types

Extend DocumentType enum:

class DocumentType(Enum):
    POWERPOINT = "powerpoint"
    WORD = "word"
    PDF = "pdf"
    EXCEL = "excel"
    # Add new types here
    GOOGLE_SLIDES = "google_slides"
    UNKNOWN = "unknown"

Implement classification logic:

def classify_document(self, filepath: str) -> DocumentType:
    extension = os.path.splitext(filepath)[1].lower()
    # Add new extension handling

2. Custom Output Formats

Extend output generation:

def generate_json_output(self, results: ProcessingResult) -> str:
    """Generate JSON output format"""
    assets_dict = [asset for asset in results.raw_data]
    return json.dumps(assets_dict, indent=2)

3. Additional AI Models

Support multiple AI providers:

def _setup_model(self):
    if self.model_name.startswith('gpt-'):
        return self._setup_openai()
    elif self.model_name.startswith('claude-'):
        return self._setup_anthropic()
    # Add other providers

4. Custom Validation Rules

Extend validation logic:

def _custom_validation_rules(self, assets: List[Dict]) -> List[str]:
    """Apply business-specific validation rules"""
    issues = []
    for asset in assets:
        # Custom validation logic
        if not asset.get('format') and asset.get('media') == 'Image':
            issues.append(f"Image asset '{asset['title']}' missing format specification")
    return issues

Security Considerations

API Key Management

Store API keys in environment variables
Use secure key rotation practices
Implement key validation before processing
Consider using secret management services

Document Security

Validate file types before processing
Implement file size limits
Scan for malicious content
Ensure secure temporary file handling

Data Privacy

Log sanitization (avoid logging sensitive content)
Secure deletion of temporary files
Consider data residency requirements for API calls
Implement audit trails for document processing

Troubleshooting Guide

Common Issues

1. "OPENAI_API_KEY not set" Error

# Solution: Set API key in process_brief_enhanced.py line 30
OPENAI_API_KEY = "your-actual-api-key"

2. LlamaParser Import Error

# Solution: Install llama-cloud-services
pip install llama-cloud-services>=0.6.62

3. Timeout Errors on Large Documents

# Solution: Use lower reasoning effort
python process_brief_enhanced.py large_doc.pdf low

4. High Processing Costs

# Solution: Use minimal reasoning for simple documents
python process_brief_enhanced.py simple_doc.docx minimal

Performance Optimization

1. Reduce Processing Time:

Use low or minimal reasoning effort for simple documents
Implement document pre-filtering
Consider document splitting for very large files

2. Reduce Costs:

Monitor token usage with cost summaries
Use cached inputs when possible
Optimize prompt length and complexity

3. Improve Accuracy:

Use high reasoning effort for complex documents
Customize prompts for specific document types
Implement domain-specific validation rules

Future Enhancements

Planned Features

Batch Processing: Process multiple documents simultaneously
Document Templates: Predefined extraction rules for common brief types
Quality Scoring: Automated confidence scoring for extractions
Export Formats: Support for JSON, XML, and database outputs
Integration APIs: REST API for system integration
Real-time Monitoring: Processing metrics and alerting

Technical Roadmap

Multi-threading: Parallel processing for large document batches
Caching Layer: Redis-based result caching
Database Integration: Direct database output options
Containerization: Docker deployment support
Cloud Deployment: AWS/GCP/Azure deployment options

Conclusion

The Adidas Brief Extraction System represents a sophisticated approach to automated document analysis, combining state-of-the-art AI models with robust engineering practices. The system's modular architecture, comprehensive error handling, and detailed monitoring make it suitable for production deployment in enterprise environments.

The multi-pass analysis pipeline ensures high accuracy in asset extraction, while the cost tracking and performance optimization features make it economically viable for large-scale document processing workflows.

For questions or support, refer to the project's logging output and error handling mechanisms, which provide detailed debugging information for troubleshooting and optimization.

19 KiB Raw Permalink Blame History