hm_ai_qc_report_tool

Author	SHA1	Message	Date
nickviljoen	6b8b8ea5a6	Video Master: revert campaigns folder + lenient name matching The earlier swap to BOX_CAMPAIGNS_FOLDER_ID=133295752718 was wrong — Video Master operates on the automation campaigns folder (156182880490), where subfolders are named by campaign TITLE rather than the numeric job ID used in Reporting's root. Reverted the default in config.py and all three .env example files. Folder naming on Box is inconsistent — '1_CFUL263C01C_Kids drop1' vs '1_CFUL263C01F-Kids drop 2' vs 'Summer Activation 2026' all coexist. search_subfolder now strips every non-alphanumeric character from both the search input and the folder names before substring match, so: "kids drop 1" → matches "1_CFUL263C01C_Kids drop1" "Spring 2026" → matches "4023 Spring 2026" "winterfilm" → matches "1_WA20263C01 Winter Film" Form label/placeholder updated to "Campaign Title" with a hint that spaces/underscores/hyphens/case are all ignored.	2026-05-09 20:19:35 +02:00
nickviljoen	087224976a	Box: search-API-first lookup + 60s enumeration cap The previous search_subfolder implementation paginated the entire parent folder before falling back to Box's indexed search API. With the campaigns folder containing thousands of children, this exceeded even the new 5-minute background-thread cap and surfaced as 'Search timed out after 5 minutes' to the user. Now: 1. Hit the indexed search API first (~1-2s typical, even on huge parents) — returns immediately on a match. 2. Fall back to a streaming enumeration only for fresh folders Box hasn't indexed yet (~10 min latency window). Capped at 60s wall clock so we don't loop forever on a missing campaign. Also improves the not-found error message to mention the indexing latency caveat — handles the otherwise-confusing case where a freshly- created campaign folder isn't searchable for a few minutes.	2026-05-09 20:03:53 +02:00
nickviljoen	a500d7b088	Six tooling fixes from Dev test pass Video QC: * _extract_locale_from_filename now also handles the suffix form ..._XX-yy.ext (case-insensitive both sides), so DOOH/OOH-style adapt filenames like ..._ES-es.mp4 unblock the price_currency check instead of skipping with "could not extract locale". * Batch results page expires the SQLAlchemy session at the top of the route so the post-completion reload sees committed reports even when it lands on a different gunicorn worker than the one that wrote them. Reload delay bumped 1s → 2s for margin. * visual_quality prompt now passes the filename's market+language to the LLM and tells it the on-screen copy should be in the localized language, not the source-language guideline copy. Stops Spanish-market videos being flagged as "language mismatch with English campaign guidelines". Printer Check: * regions.json rewritten to cover all 10 H&M regions (AME, CEU, NEU, GCN, IND, SHE, SEU, EEU, EAS, Franchise) with default-all groups. Two judgement calls vs the screenshot: kept TR for Turkey (TK is Tokelau in ISO and would break filename matching) and BR for Brazil (every other code is 2-letter ISO). Campaign codes: * New core/utils/campaign_code.py is the single source of truth. Matches both the legacy 4-digits-plus-optional-letter (1013A, 4116) and the new 11-char alphanumeric with year at positions 5-6 (CFUL263C01D). All four prior parser sites now import from this helper. Video Master: * BOX_CAMPAIGNS_FOLDER_ID switched 156182880490 → 133295752718 (same root the Reporting tool uses). Updated config.py default and all three .env example files. * Match page now shows which Box folder the search runs against (with a clickable link), and on a not-found error explains what was searched for so missing-campaign cases are self-diagnosable.	2026-05-09 18:32:23 +02:00
nickviljoen	84326352b2	Phase 1: replace local username/password auth with Azure AD SSO Lifted JWT-cookie auth pattern from the AI QC sibling project: core/auth/middleware.py validates Azure AD JWTs and stores them in an httpOnly cookie (hm_aiqc_auth_token). Tenant membership is enforced by JWTValidator's tid check, which is sufficient for the tenant-wide access policy chosen for this project. templates/login.html now drives an MSAL.js popup that POSTs the ID token to /auth/login. base.html exposes Azure config to all pages so the logout button can also clear the MSAL session. app.py's @before_request now checks the JWT cookie and exposes g.user; modules read user identity via core.auth.current_user_email so usage logs and created_by columns now record the signed-in user's email rather than a session value. Legacy username/password code removed: top-level auth_middleware.py, jwt_validator.py, deploy/generate_password.py.	2026-05-09 13:59:29 +02:00
nickviljoen	2258fa532b	Phase 0: bootstrap Alembic, add /health, prep for Dev/Prod cutover - core/health blueprint exposes GET /health for deploy smoke tests - Replace db.create_all() + ensure_schema() ALTER patch with Alembic - Initial migration captures current schema (5 tables, all indexes) - docker-entrypoint runs wait_for_db.py + flask db upgrade before gunicorn	2026-05-09 13:47:54 +02:00
nickviljoen	39383db95f	Pricing refs: Excel support, structured lookup, deterministic price match, video price check A. Excel upload — /campaigns/pricing/upload now accepts .xlsx/.xls alongside .pdf. File picker in the campaigns UI matches. B. Deterministic Excel parser (openpyxl, no LLM) — looks for H&M-style mastersheets: - 'MPC Prices' sheet -> flat list of {product_id, language, country, price, currency, product_name} entries (this is the gold mine). - Regional sheets (AME/CEU/EEU/...) -> formatted prices per locale used to derive currency symbol, position, decimal/thousands separators. Skips OLD/COPY sheets. Verified against the attached 1013A mastersheet: 448 price entries across 7 products x 74 locales, 139 locale format entries. Parser lives in modules/campaigns/pricing_parser.py alongside the existing PDF path (which now also returns the structured form with empty _prices). New lookup shape stored in PricingReference.parsed_data_json: {"_format": {"en-US": {currency_code, symbol, position, ...}, ...}, "_prices": [{product_id, language, country, price, currency, product_name}, ...]} Legacy flat {"<code>": {...}} is still recognised (treated as _format only) for backwards compatibility with the legacy global JSON import. Model helpers added: - PricingReference.get_format_map() - PricingReference.get_prices() to_dict() now reports price_count alongside entry_count. C. Upgraded price_currency_check.py — when a pricing reference with _prices is attached, the check runs a deterministic comparison: detected price(s) -> normalize (_normalize_price handles '$49.99', '39,99 €', 'CHF 49.95', '1.234,56', 'Rs. 2,799', '13 995 Ft', '349,-', '0.999.000'...) -> compare with tol=0.005 against the expected per-locale rows. LLM-based campaign-sheet fallback only runs if no _prices are present (legacy PDF reference or has_pricing campaign presentation). D. Video QC price check — new _run_price_check step in the executor. Parses filename (Market_lang_CampaignNum_... -> 'lang-Market' locale), detects prices across frames via the same Gemini/GPT-4o path the other checks use, then deterministic-validates against the attached pricing reference. Skipped if no pricing ref, unknown locale, GEN/CEN markets, or no price visible in video. Overall video score now uses weighted mean of active (non-skipped) checks (visual_quality w=50, censorship w=50, price_currency w=30) instead of the hardcoded 50/50 split — so skipping any one check falls through cleanly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 10:52:39 +02:00
nickviljoen	e5d0d468db	Pricing references: standalone library (was single global file) The "Global Pricing Reference" is no longer a single file at storage/reference/global_pricing.json. Pricing references are now first-class DB rows (PricingReference model), uploadable as a library in the Campaigns tab and selectable per-run alongside the campaign presentation dropdown on the HM QC and Video QC configure pages. New: - core/models/pricing_reference.py — PricingReference model: id, name, pdf_filename, pdf_path, parsed_content, parsed_data_json, status, created_at/by. get_lookup() deserializes parsed_data_json; to_dict() powers the dropdown API. - /campaigns/pricing/upload — creates a PricingReference row, saves PDF under storage/pricing_references/<id>/, kicks off background parse. - /campaigns/pricing/<id> DELETE, /campaigns/api/pricing/list, /campaigns/api/pricing/status/<id>. - Campaigns index: "Pricing References" table card (mirrors the presentations card) + upload form with optional name field. Changed: - pricing_parser: parse_pricing_pdf_to_dict returns (dict, raw_text); new parse_pricing_reference(id) runs the parse against a DB row and sets status to ready/error. Legacy file-based path removed. - QCExecutor and VideoQCExecutor accept pricing_reference_id; load the row into context['pricing_reference']={id, name, lookup}. - BatchQCExecutor and BatchVideoQCExecutor thread pricing_reference_id through to per-file executors. - price_currency_check._validate_currency reads context instead of the disk file; returns 'skipped_no_reference' if no ref attached. - HM QC + Video QC /execute and /execute/batch routes pass pricing_reference_id from the JSON payload. - Configure templates for HM QC and Video QC add a second dropdown "Pricing Reference (Optional)" loaded from /campaigns/api/pricing/list. Backwards compatibility: - app.py: on startup, if storage/reference/global_pricing.json exists and the pricing_references table is empty, import it as a "Default (legacy global)" PricingReference row so existing installs keep a valid reference attached (user can pick it at configure time). - config.py: retains GLOBAL_PRICING_{PDF,JSON}_PATH for the legacy importer; adds PRICING_REF_STORAGE_PATH for the new per-row storage. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 10:27:09 +02:00
nickviljoen	6341714899	Split input/output token tracking; refresh provider pricing table UsageLog now records input_tokens and output_tokens separately and costs each side at its real rate. The old single 'blended' rate underpriced input-heavy workloads (vision/QC) and overpriced output-heavy ones. COST_PER_MILLION_TOKENS rebuilt against the live OpenAI, Gemini and Anthropic pricing pages (GPT-5.4 family, GPT-4.x, o4-mini; Gemini 2.5 Pro/Flash/Flash-Lite + 1.5 legacy; Claude 4.7/4.6/4.5 + 3.x legacy). Unknown models now warn instead of silently defaulting to $5/1M. Adds idempotent ALTER TABLE migration on startup so existing SQLite DBs pick up the new columns. Dashboard + API surface the input/output split. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 14:40:13 +02:00
nickviljoen	ffb4745d83	Batch naming, delete batch, consistent results view - Show job number in batch header instead of just "Batch <date>" - Add delete batch button (trash icon) that removes all reports + files - New DELETE /hm-qc/report/batch/<batch_id> route - Unified batch results view: always renders from DB reports (not ephemeral progress tracker data), so the view is identical whether you just completed a batch or navigated back from another tab - Include thumbnails in batch results per-file rows Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 16:38:25 +02:00
nickviljoen	63b8a04c46	Fix persistent OOM: reduce image size, force GC, recycle workers Still OOM after 7 files despite sequential processing. Root cause: Python's allocator doesn't return freed memory to the OS, so image buffers accumulate across files until the OOM killer strikes. Fixes: - Reduce LLM image max size from 2000px to 1200px (64% less RAM per image, still sufficient for vision analysis) - Always close PIL images immediately (not just when opened locally) - Replace ThreadPoolExecutor with simple sequential loop + gc.collect() after each file to force memory reclamation - Switch gunicorn to gthread (2 workers x 2 threads) for better request concurrency without extra memory overhead - Add max_requests=200 to auto-recycle workers and release accumulated memory Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 16:17:59 +02:00
nickviljoen	5e3f071344	Fix OOM crash on large batches: reduce concurrency and free image memory Worker was SIGKILL'd by OOM killer during batch QC (18 files). Fixes: - Reduce MAX_CONCURRENT_FILES from 2 to 1 (sequential processing) - Reduce gunicorn workers from 4 to 2 (less memory contention) - Explicitly close PIL images after thumbnail generation - Close BytesIO buffers and PIL images after base64 encoding Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 16:02:30 +02:00
nickviljoen	8a7d477c86	Fix batch QC: add Flask app context to ThreadPoolExecutor child threads ThreadPoolExecutor workers don't inherit the parent thread's Flask app context, causing "Working outside of application context" errors during batch QC execution. Pass the app instance into BatchQCExecutor and wrap each child thread's work with app.app_context(). Also ensure the progress_sessions table is created on fresh databases. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 15:20:56 +02:00
nickviljoen	d036752d17	v2.2.0: Gemini video, batch grouping, thumbnails, speed, price fix, printer check - Video QC: Switch to Google Gemini direct video analysis as default (OpenAI frame grid fallback) - HM QC: Group reports by batch with collapsible sections, ZIP download per batch - HM QC: Generate asset thumbnails (150px) displayed in report listings - Speed: Remove artificial delays, add ThreadPoolExecutor(2) for parallel batch processing - Price detection: Improved prompt with country context, detect all prices, increased text limit - New Printer Check module: CSV-to-PDF cross-referencing ported from CrossMatch Rust app Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 13:56:07 +02:00
nickviljoen	9c33858726	Add campaign presentation management and global pricing reference Introduces a new Campaigns module for uploading campaign presentation PDFs that QC checks reference to validate assets against campaign-specific guidelines (typography, layout, copy, pricing format). Also adds a global pricing reference system that maps country codes to currency symbols and formats for deterministic price/currency validation. - New CampaignPresentation model + campaigns blueprint with CRUD routes - PDF parsing via LlamaParse (text + multimodal page images) - Global pricing PDF parsed into structured JSON lookup - Campaign context injected into both image and video QC executors - Quality checks enhanced with campaign guidelines in LLM prompts - Price/currency check uses global pricing lookup (saves an LLM call) - Campaign dropdown added to HM QC and Video QC configure pages Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-26 16:12:22 +02:00
nickviljoen	6205b1cb18	Rewrite Box folder methods to avoid .get() entirely Box SDK .get() on folder objects fails with "Item.get() takes 1 positional argument" in the deployed environment. Replaced all folder.get() calls with a new _get_folder_items() helper that uses get_items() with pagination, falling back to folder.get() only as last resort. This fixes list_subfolders, list_video_files, and search_subfolder. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-21 22:00:16 +02:00
nickviljoen	272b8ea055	Fix list_video_files to search subfolders recursively Global Masters folder contains subfolders (DOOH, DS, OLV, etc.) with videos inside them, not videos directly. Added recursive=True option to search one level of subfolders for video files. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-21 21:50:03 +02:00
nickviljoen	ccfa49cdad	Fix Box SDK folder.get() call — remove fields parameter Box SDK v3 Item.get() doesn't accept fields as positional argument. Remove fields parameter and let it return full folder info including item_collection with inline entries. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-21 21:44:26 +02:00
nickviljoen	834b9ee3e2	Fix Box API for collaborated folders: use folder.get() with inline items The CAMPAIGNS folder is owned by a different user and shared via collaboration. get_items() and search API fail with "not found" for these folders, but folder.get() works and returns inline items. - Rewrite search_subfolder() to use folder.get() first, with pagination fallback for folders with >100 items - Rewrite list_subfolders() and list_video_files() with same approach - Add BOX_CAMPAIGNS_FOLDER_ID config (156182880490) separate from the QC reports folder Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-21 21:36:25 +02:00
nickviljoen	80d305d123	Fix Video Master: use correct Box campaigns folder ID, improve search - Add BOX_CAMPAIGNS_FOLDER_ID config (156182880490) separate from BOX_REPORT_FOLDER_ID which is for QC reports - Update search_subfolder() to use Box search API first (fast for large folders with 1000+ campaigns), fall back to folder listing - Increase folder listing limit from 200 to 500 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-21 21:15:59 +02:00
nickviljoen	7feead49d1	Implement Video Master: campaign-based master-to-adaptation matching Full workflow: - Enter campaign name → search Box for campaign folder - Auto-discover Global Masters and Regional Masters subfolders - Preview: shows master count, countries, adaptation count - Phase 1: Download each master to temp, fingerprint, delete video - Phase 2: Download each adaptation to temp, match against masters, delete - Results: per-master adaptation mapping, unmatched items, match rate - HTML report with detailed breakdown - Previous Matching Jobs table with View/Delete Box client additions: - search_subfolder() - case-insensitive subfolder search - list_subfolders() - enumerate child folders - list_video_files() - list video files in folder - download_file_to_disk() - streaming download for large files (ProRes) Storage: only fingerprints (~50KB) + key frames stored permanently. Videos deleted immediately after processing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-21 21:06:37 +02:00
nickviljoen	b4e94ad4eb	Update default Google model to gemini-2.5-flash Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-21 18:59:00 +02:00
nickviljoen	e910e00edf	Add Usage Dashboard with token tracking, cost estimates, and filters - New UsageLog model tracking every LLM API call (provider, model, tokens, estimated cost, user, module, check name) - Instrument LLMConfig.call_vision_api() to auto-log each call - New /usage tab in nav bar with dashboard showing: - Summary cards (total calls, tokens, estimated cost) - Breakdowns by provider, model, tool, and user - Recent API calls table - Time filters (All Time, 30 Days, 7 Days, Today) - Cost estimates based on per-model token pricing - Pass logged-in user through executor context for tracking Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-21 18:17:21 +02:00
nickviljoen	501db24e05	Fix Box search infinite pagination loop Box search generator auto-paginates through ALL results (35k+ for broad queries). Added iteration caps to prevent runaway API calls: - _search_folder_by_name: cap at 50 results - _search_files_by_job_number: cap at 100 results - _scan_folder_by_name: cap at 1000 items Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-21 17:00:00 +02:00
nickviljoen	91dec41e0b	Batch 3: Add title legibility check, Google Gemini support, LLM provider selector - Update image quality prompt to evaluate text/title legibility - Add Google Gemini (generativeai) as LLM provider in LLMConfig - Add AI Provider dropdown on configure page (OpenAI GPT-4o / Google Gemini) - Pass selected provider through execute routes to override profile defaults - Add google-generativeai to requirements.txt Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-21 16:53:07 +02:00
nickviljoen	f21e41afc3	v1.2.0: Add Docker deployment, simplify auth to local login, production config - Add Dockerfile, docker-compose.yml, .dockerignore for containerised deployment - Add deploy/ scripts (deploy.sh, nginx/apache configs, password generator) - Replace MSAL/Azure AD auth with local username/password authentication - Add login.html template - Simplify app.py, middleware, and auth routes for production use - Update gunicorn_config.py and wsgi.py for Docker/production - Update templates to work with new auth and URL prefix handling Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-21 14:37:53 +02:00
nickviljoen	ffd8b7303c	v1.1.0: Add progress tracking, CSV export, multi-job support, batch processing, and security fixes - Reporting: async search with SSE progress bar, CSV export with Box file links, multi-job support, designer-friendly error display with action guidance - HM QC: batch file upload (up to 100 files), batch execution with rate limiting, batch results summary - Fix: SQLAlchemy stale cache in SSE progress streaming (expire_all + commit) - Fix: Box folder pagination loop (search API instead of iterating 10,300 folders) - Fix: HM QC blank screen (progress.js not loaded, hardcoded wrong URLs) - Security: remove hardcoded API keys from legacy files, read from .env instead Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-13 09:43:20 +02:00
nickviljoen	e6f3e9387e	Add modular architecture, core framework, and web UI New blueprint-based module system (hm_qc, video_qc, video_master, reporting), core framework (database, config, templates), and unified web interface with progress tracking and tab navigation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-25 11:39:04 +02:00

27 commits