Update docs with PDF processing, media plans, and production notes

Document pdf_processor.py and media_plan_processor.py in main components. Add detailed sections for PDF reference asset processing and media plan system. Add production permissions fix to common issues table. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 23:11:34 +02:00 · 2026-03-26 23:11:34 +02:00 · 6333cdeb3e
commit 6333cdeb3e
parent 5429e4c684
2 changed files with 54 additions and 0 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@ -19,6 +19,8 @@ Visual AI QC is a Python Flask-based AI-powered quality control platform for ana
 - **`usage_tracker.py`** - Usage tracking and cost estimation system
 - **`generate_usage_report.py`** - Command-line tool for generating usage reports
 - **`client_config.py`** - Client-profile relationship management with visibility control
+- **`pdf_processor.py`** - PDF text extraction, LLM summarization for brand guidelines
+- **`media_plan_processor.py`** - Excel media plan parsing, filename matching, spec validation
 - **`web_ui.html`** - Single-page web interface for uploads and analysis

 ### Key Design Patterns
@ -637,6 +639,30 @@ All authenticated user visits are now logged:
 - **Storage**: Same JSONL usage logs in `backend/usage_logs/`
 - **Purpose**: Enables admin panel to show all users who have visited, not just those who ran analyses

+### PDF Reference Asset Processing
+Multi-page PDF brand guidelines are now fully processed on upload:
+
+- **Text extraction**: All pages extracted using PyMuPDF (`pdf_processor.py`)
+- **LLM summarization**: Extracted text sent to Gemini 2.5 Pro for structured brand guidelines summary (2000-4000 words covering colors, typography, layout, do's/don'ts, QC specs)
+- **Cover image**: Page 1 extracted as PNG for visual reference in QC checks
+- **Storage**: `{file_id}_summary.txt` and `{file_id}_cover.png` in `brand_guidelines/files/`
+- **QC integration**: Summary text included in check prompts, cover image sent as visual reference
+- **Fallback chain**: LLM summary → raw text (8000 chars) → inline extraction → metadata only
+- **Auto-backfill**: Existing unprocessed PDFs processed on server startup
+- **API endpoints**: `GET /api/brand_guidelines/<id>/status`, `POST /api/brand_guidelines/<id>/reprocess`
+
+### Media Plan System
+Excel media plans can be uploaded per client for automatic asset validation:
+
+- **Upload**: Settings → Media Plan tab, accepts .xlsx/.xls files
+- **Parsing**: Extracts asset specs from all channel sheets (Display, OLV, OOH, TV, Print, Audio) using openpyxl
+- **Filename matching**: Automatic fuzzy matching (exact → case-insensitive → starts-with → contains → fuzzy >70%)
+- **Validation**: Checks uploaded asset dimensions and file type against media plan spec
+- **QC context**: Matched asset metadata (country, language, placement, vendor, dimensions) injected into all check prompts
+- **Storage**: `backend/media_plans/` directory with parsed JSON cache
+- **API endpoints**: `POST /api/media_plan`, `GET /api/media_plan?client={id}`, `DELETE /api/media_plan/<client_id>`
+- **Module**: `media_plan_processor.py` - `parse_media_plan()`, `find_matching_asset()`, `validate_asset_specs()`, `build_media_plan_context()`
+
 ## Production Deployment

 ### Critical Production Issues and Solutions
@ -775,6 +801,7 @@ ProxyPassReverse /ai_qc http://localhost:7183
 | MSAL errors | Browser console | Clear browser cache, check concurrent sign-in protection |
 | Backend not starting | `systemctl status ai_qc` | Check Python environment, dependencies, port conflicts |
 | Permission errors | File ownership | Ensure www-data owns necessary directories |
+| Permission denied on new dirs | `git pull` resets ownership | `sudo chown -R www-data:www-data uploads output media_plans brand_guidelines usage_logs` |

 ## Pre-Session Completion Checklist
 Before ending any session, ALWAYS run these Python syntax and import checks:
--- a/backend/CLAUDE.md
+++ b/backend/CLAUDE.md
@ -19,6 +19,8 @@ Visual AI QC is a Python Flask-based AI-powered quality control platform for ana
 - **`usage_tracker.py`** - Usage tracking and cost estimation system
 - **`generate_usage_report.py`** - Command-line tool for generating usage reports
 - **`client_config.py`** - Client-profile relationship management with visibility control
+- **`pdf_processor.py`** - PDF text extraction, LLM summarization for brand guidelines
+- **`media_plan_processor.py`** - Excel media plan parsing, filename matching, spec validation
 - **`web_ui.html`** - Single-page web interface for uploads and analysis

 ### Key Design Patterns
@ -637,6 +639,30 @@ All authenticated user visits are now logged:
 - **Storage**: Same JSONL usage logs in `backend/usage_logs/`
 - **Purpose**: Enables admin panel to show all users who have visited, not just those who ran analyses

+### PDF Reference Asset Processing
+Multi-page PDF brand guidelines are now fully processed on upload:
+
+- **Text extraction**: All pages extracted using PyMuPDF (`pdf_processor.py`)
+- **LLM summarization**: Extracted text sent to Gemini 2.5 Pro for structured brand guidelines summary (2000-4000 words covering colors, typography, layout, do's/don'ts, QC specs)
+- **Cover image**: Page 1 extracted as PNG for visual reference in QC checks
+- **Storage**: `{file_id}_summary.txt` and `{file_id}_cover.png` in `brand_guidelines/files/`
+- **QC integration**: Summary text included in check prompts, cover image sent as visual reference
+- **Fallback chain**: LLM summary → raw text (8000 chars) → inline extraction → metadata only
+- **Auto-backfill**: Existing unprocessed PDFs processed on server startup
+- **API endpoints**: `GET /api/brand_guidelines/<id>/status`, `POST /api/brand_guidelines/<id>/reprocess`
+
+### Media Plan System
+Excel media plans can be uploaded per client for automatic asset validation:
+
+- **Upload**: Settings → Media Plan tab, accepts .xlsx/.xls files
+- **Parsing**: Extracts asset specs from all channel sheets (Display, OLV, OOH, TV, Print, Audio) using openpyxl
+- **Filename matching**: Automatic fuzzy matching (exact → case-insensitive → starts-with → contains → fuzzy >70%)
+- **Validation**: Checks uploaded asset dimensions and file type against media plan spec
+- **QC context**: Matched asset metadata (country, language, placement, vendor, dimensions) injected into all check prompts
+- **Storage**: `backend/media_plans/` directory with parsed JSON cache
+- **API endpoints**: `POST /api/media_plan`, `GET /api/media_plan?client={id}`, `DELETE /api/media_plan/<client_id>`
+- **Module**: `media_plan_processor.py` - `parse_media_plan()`, `find_matching_asset()`, `validate_asset_specs()`, `build_media_plan_context()`
+
 ## Production Deployment

 ### Critical Production Issues and Solutions
@ -775,6 +801,7 @@ ProxyPassReverse /ai_qc http://localhost:7183
 | MSAL errors | Browser console | Clear browser cache, check concurrent sign-in protection |
 | Backend not starting | `systemctl status ai_qc` | Check Python environment, dependencies, port conflicts |
 | Permission errors | File ownership | Ensure www-data owns necessary directories |
+| Permission denied on new dirs | `git pull` resets ownership | `sudo chown -R www-data:www-data uploads output media_plans brand_guidelines usage_logs` |

 ## Pre-Session Completion Checklist
 Before ending any session, ALWAYS run these Python syntax and import checks: