# H&M Quality Control (HMQC) System ## Overview The H&M Quality Control (HMQC) system is a modular Python application designed to perform automated quality control checks on both PDF documents and static images (JPG, PNG, PSD) for H&M marketing assets. The system validates assets against multiple criteria including filename formatting, image dimensions, imprint verification, language validation, pricing accuracy, and censorship requirements. ## Quick Links - **[Development Setup Guide](DEV_SETUP.md)** - Setting up local dev environment - **[Project Instructions](CLAUDE.md)** - Development guidelines and architecture - **[Check Modules](#check-modules)** - Detailed check documentation ## Table of Contents - [Architecture](#architecture) - [Project Structure](#project-structure) - [Key Components](#key-components) - [Installation](#installation) - [Environment Configuration](#environment-configuration) - [Configuration](#configuration) - [Usage](#usage) - [Check Modules](#check-modules) - [PDF Check Modules](#pdf-check-modules) - [Image Check Modules](#image-check-modules) - [API Integrations](#api-integrations) - [Security Considerations](#security-considerations) - [Development](#development) - [Troubleshooting](#troubleshooting) --- ## Architecture The HMQC system follows a modular, profile-based architecture: ``` ┌─────────────┐ │ Launcher │ (CLI or Box Hotfolder) └──────┬──────┘ │ ▼ ┌─────────────┐ │ QC Module │ (Core Engine) └──────┬──────┘ │ ├──► Profile JSON (defines checks) │ ├──► Check Modules (modular validation) │ ├─ HM_parse │ ├─ HM_filename_parse │ ├─ HM_imprint_check │ ├─ HM_language_validate │ ├─ HM_price_currency_check │ └─ HM_censorship │ └──► HTML Reporter (generates reports) ``` ### Core Principles 1. **Modularity**: Each check is an independent module implementing a standard interface 2. **Context Sharing**: Checks share data via a context dictionary for inter-check dependencies 3. **Profile-Based Configuration**: JSON profiles define which checks to run and their parameters 4. **Standardized Results**: All checks return consistent status objects (`passed`, `error`, `skipped`) --- ## Project Structure ``` hm_qc/ ├── qc_module.py # Core QC engine ├── config.py # Environment configuration (NEW) ├── requirements.txt # Python dependencies ├── README.md # This file ├── CLAUDE.md # Project instructions for Claude ├── DEV_SETUP.md # Development environment guide (NEW) │ ├── launchers/ # Execution entry points │ ├── HM_launcher_CLI.py # Command-line interface (environment-aware) │ └── ford_qc_box_hotfolder_process.py # Box integration │ ├── checks/ # QC check modules │ ├── HM_parse.py # PDF parsing with LlamaParse │ ├── HM_filename_parse.py # PDF filename validation │ ├── HM_imprint_check.py # Imprint verification (includes country code) │ ├── HM_image_parse.py # Image loading and processing (NEW) │ ├── HM_image_filename_parse.py # Image filename parsing (NEW) │ ├── HM_image_dimension_check.py # Image dimension validation (NEW) │ ├── HM_language_validate.py # Language detection │ ├── HM_price_currency_check.py # Currency validation │ ├── HM_censorship.py # Censorship compliance - CEN only (UPDATED) │ ├── analyze_with_gpt.py # OpenAI GPT integration │ ├── html_reporter.py # HTML report generation (environment-aware) │ ├── business_data_check.py │ ├── colour_existence_check.py │ ├── file_size_check.py │ ├── image_linking_check.py │ ├── image_resolution_check.py │ ├── missing_images_check.py │ ├── special_requirements_mec_bau.py │ └── unzip_and_verify_check.py │ ├── profiles/ # Check configurations │ ├── HM.json # H&M PDF profile │ ├── HM_image.json # H&M image profile (NEW) │ └── ford_bnp.json # Ford BNP profile │ ├── supporting/ # Supporting assets │ ├── censorship_trainset/ # Training images for censorship detection │ └── HM_Pricing SLUSSEN_30-07-2024.pdf │ ├── tmp/ # Development temporary files (gitignored) │ ├── HM_working/ # Working directory (dev mode) │ └── reports/ # Generated reports (dev mode) │ └── input_bucket/ # Input file staging area ``` --- ## Key Components ### 1. QC Module (`qc_module.py`) The core engine that orchestrates check execution. **Key Functions:** - `run_qc_profile(profile_path, input_file)`: Executes all checks defined in a profile - `run_single_check(script, config, context, check_id)`: Runs an individual check module - `run_qc_checks(profile_path, input_file, report_path)`: Full workflow including report generation **Context Management:** Results from each check are stored in a shared `context` dictionary, enabling downstream checks to access data from upstream checks. ### 2. Launchers #### CLI Launcher (`HM_launcher_CLI.py`) Command-line interface for manual QC execution. **Usage:** ```bash python launchers/HM_launcher_CLI.py ``` #### Box Hotfolder Launcher (`ford_qc_box_hotfolder_process.py`) Automated processing of files uploaded to Box folders. **Features:** - Monitors Box folder for new `.zip` files - Downloads, processes, and deletes files - Uploads QC reports back to Box - File locking to prevent concurrent runs ### 3. Check Module Pattern All check modules implement a standard interface: ```python def run_check(config: dict, context: dict, check_id: str) -> dict: """ Args: config: Configuration parameters from profile JSON context: Shared context dictionary for inter-check data check_id: Unique identifier for this check Returns: { "status": "passed" | "error" | "skipped", "details": {...}, "error_message": "..." (if status == "error") } """ ``` ### 4. HTML Reporter (`html_reporter.py`) Generates Bootstrap-based HTML reports with: - Collapsible accordion for each check - Color-coded status badges - Nested detail formatting - Error highlighting --- ## Installation ### Prerequisites - Python 3.8+ - pip package manager - OpenAI API key - LlamaParse API key ### Setup 1. **Clone the repository:** ```bash cd /path/to/deployment ``` 2. **Install dependencies:** ```bash pip install -r requirements.txt ``` 3. **Configure API keys** (see [Security Considerations](#security-considerations)): - OpenAI API key for GPT analysis - LlamaParse API key for PDF parsing 4. **Set up paths:** Update hardcoded paths in launcher scripts: ```python # HM_launcher_CLI.py PROFILE_PATH = "/opt/QC/profiles/HM.json" # qc_module.py REPORTS_DIR = os.path.join(os.path.dirname(__file__), "reports") ``` 5. **Create working directories:** ```bash mkdir -p /tmp/HM_working mkdir -p /opt/QC/reports ``` --- ## Environment Configuration The HMQC system supports two environments to enable safe local development without affecting production: ### Development Environment For local testing and development: ```bash # Set environment variable export HM_QC_ENV=dev # Run checks (uses local paths in ./tmp/) python launchers/HM_launcher_CLI.py test.pdf ./tmp/reports/report.html ``` **Development paths:** - Working directory: `./tmp/HM_working/` - Reports: `./tmp/reports/` - Supporting files: `./supporting/` ### Production Environment For deployed production systems (default): ```bash # Unset or use production mode unset HM_QC_ENV # or export HM_QC_ENV=production # Run checks (uses /opt/QC paths) python launchers/HM_launcher_CLI.py input.pdf /opt/QC/reports/ ``` **Production paths:** - Working directory: `/tmp/HM_working/` - Reports: `/opt/QC/reports/` - Supporting files: `/opt/QC/supporting/` ### Configuration Module The `config.py` module automatically handles environment detection: ```python import config # Check current environment print(config.ENVIRONMENT) # 'dev' or 'production' # Get paths print(config.WORKING_DIR) print(config.REPORTS_DIR) ``` **See [DEV_SETUP.md](DEV_SETUP.md) for complete development environment guide.** --- ## Configuration ### Profile JSON Structure Profiles define the sequence of checks and their parameters: ```json [ { "id": "HM_parse", "script": "checks.HM_parse", "config": { "description": "Parses PDF with LlamaParse", "input_file": "supplied by launcher script", "working_dir": "/tmp/HM_working" } }, { "id": "HM_filename_parse", "script": "checks.HM_filename_parse", "config": { "description": "Parses filename into components", "working_dir": "/tmp/HM_working" } } ] ``` **Key Fields:** - `id`: Unique check identifier (used for context storage) - `script`: Python module path (e.g., `checks.HM_parse`) - `config`: Check-specific parameters ### H&M Filename Format The system expects H&M filenames to follow these patterns: **Pattern 1:** `dimensions_format_year_reference-number_language-country.pdf` - Example: `21.6x27.9cm_letter_2028_10062-01_en-us.pdf` **Pattern 2:** `dimensions_format_reference-number_(GEN|CEN).pdf` - Example: `10.8x14cm_quarter_letter_1001D_10004-02_GEN.pdf` --- ## Usage ### Basic CLI Usage ```bash python launchers/HM_launcher_CLI.py input.pdf /path/to/reports/ ``` This will: 1. Execute all checks defined in `/opt/QC/profiles/HM.json` 2. Generate an HTML report in the specified directory 3. Print JSON results to stdout ### Box Hotfolder Integration ```bash python launchers/ford_qc_box_hotfolder_process.py ``` **Configuration:** - Modify `BOX_CLI_CONFIG_PATH` to point to your Box JWT config - Set `SOURCE_FOLDER_ID` and `REPORT_FOLDER_ID` - Update `PROFILE_PATH` to your desired profile **Workflow:** 1. Script polls Box source folder for `.zip` files 2. Downloads and processes each file 3. Generates QC report 4. Uploads report to Box reports folder 5. Deletes processed files --- ## Check Modules The HMQC system includes separate check profiles for PDF documents and static images. ### PDF Check Modules #### HM_parse **Purpose:** Parses PDF using LlamaParse API **Context Output:** - `extracted_text`: Full text content - `parsed_image`: First page as PIL Image - `all_images`: List of all page images **Dependencies:** LlamaParse API --- #### HM_filename_parse **Purpose:** Extracts components from H&M PDF filename using GPT **Context Input:** `HM_parse.filename` **Context Output:** ```python { "parsed": { "dimensions": "21.6x27.9cm", "format": "letter", "year": "2028", "reference": "9000_10107-06", # Full reference with prefix "language": "en-us" } } ``` **Updated:** Now correctly extracts full reference codes including numeric prefixes (e.g., `9000_10107-06`) --- #### HM_imprint_check **Purpose:** Verifies imprint/reference code matches filename including country code **Context Input:** - `HM_parse.extracted_text` - `HM_filename_parse.parsed.reference` - `HM_filename_parse.parsed.language` **Logic:** 1. Uses GPT to detect imprint in document 2. Combines reference code with language/country code (e.g., `9000_10107-06_el-CY`) 3. Compares against detected imprint 4. Skips for "OOH" (out-of-home) files **Updated:** Now includes country code in validation to detect mismatches like CY vs GR --- ### Image Check Modules #### HM_image_parse **Purpose:** Loads and processes static image files (JPG, PNG, PSD) **Context Output:** - `parsed_image`: PIL Image object - `filename`: Original filename - `image_format`: Format (JPEG, PNG, PSD) - `image_size`: (width, height) in pixels - `image_mode`: Color mode (RGB, RGBA, etc.) --- #### HM_image_filename_parse **Purpose:** Parses image filename using pattern matching **Context Input:** `HM_image_parse.filename` **Context Output:** ```python { "parsed": { "language": "en-US", "format_type": "DOOH", "campaign_number": "4045", "dimensions": "1080x1920" } } ``` **Supported Formats:** - SOME STATIC: `Market_Language_campaignnumber_...` - DOOH/OOH/Display: `CampaignNumber_Type_Static_..._Size_Language-Market` - POS GEN: `Size_Format_Campaign_POPNumber_GEN` - POS Country: `Size_Format_Campaign_POPNumber_Language-Market` - DS: `Campaign_Name_Index_BU_Resolution_language-COUNTRY` --- #### HM_image_dimension_check *(NEW)* **Purpose:** Validates actual image dimensions match filename specification **Context Input:** - `HM_image_parse.parsed_image` - `HM_image_filename_parse.parsed.dimensions` **Logic:** 1. Extracts expected dimensions from filename (e.g., `1200x400`) 2. Gets actual image dimensions from PIL Image object 3. Compares width × height in pixels **Example:** - Filename: `campaign_1200x400_en-US.jpg` - Expected: 1200px × 400px - Actual: 1200px × 400px → ✅ PASS - Actual: 1500px × 600px → ❌ ERROR --- ### Shared Check Modules These checks work for both PDFs and images: #### HM_language_validate **Purpose:** Validates document/image language matches filename code **Context Input:** - `HM_parse.extracted_text` OR `HM_image_parse.parsed_image` - `HM_filename_parse.parsed.language` OR `HM_image_filename_parse.parsed.language` **Context Output:** - `detected_language`: ISO language code (e.g., "en-US") - `matches`: Boolean - `isCensorshipRequired`: True for CEN markets **Special Cases:** - `CEN`: Censored market (requires body coverage checks) - `GEN`: General market (no censorship required) --- #### HM_price_currency_check **Purpose:** Validates currency matches regional expectations **Context Input:** - Content (text or image) - Language/market code **Logic:** 1. Detects currency and price using multimodal GPT analysis 2. Validates currency against region (e.g., USD for en-us) 3. Skips for CEN/GEN markets --- #### HM_censorship **Purpose:** Verifies proper clothing coverage for CEN (censored) markets ONLY **Context Input:** - `parsed_image` (from HM_parse or HM_image_parse) - `parsed.language` (from filename parse) **Process:** 1. **SKIPS for GEN files** - GEN assets do not require censorship checks 2. **SKIPS for standard market files** - Only CEN files are checked 3. **RUNS for CEN files only** - Validates body coverage using AI **Algorithm:** 1. Trains DSPy-based classifier using training images 2. Analyzes document images for body coverage 3. Validates against censorship requirements **Training Set:** `/supporting/censorship_trainset/` - Censored images: `*-C.png` - Uncensored images: `*-U.png` **Technology:** Uses DSPy with MIPROv2 optimization **Updated:** Now correctly only runs checks on CEN files, not GEN files --- ## API Integrations ### 1. LlamaParse **Purpose:** PDF parsing and image extraction **Usage:** - Text extraction: `result_type="text"` - Image generation: `result_type="markdown"` with multimodal model **Key in Code:** `checks/HM_parse.py:14` --- ### 2. OpenAI GPT-4o **Purpose:** Content analysis, language detection, validation **Usage:** - Synchronous client in `analyze_with_gpt.py` - Multimodal support (text + images) - JSON and text response modes **Models Used:** - `gpt-4o`: Primary analysis model - `gpt-4o-mini`: Used in DSPy for censorship detection **Key in Code:** `checks/analyze_with_gpt.py:13` --- ### 3. DSPy **Purpose:** AI-powered censorship detection **Components:** - `ImageDescription`: Generates clothing coverage descriptions - `ImageCensorshipDetection`: Classifies censorship status - `MIPROv2`: Optimizes prompts based on training set **Key in Code:** `checks/HM_censorship.py:11` --- ### 4. Box Python SDK **Purpose:** Cloud storage integration for automated workflows **Authentication:** JWT-based service account **Configuration File:** `ford_box_config.json` --- ## Security Considerations ### CRITICAL ISSUES IDENTIFIED ⚠️ **HARDCODED API KEYS FOUND** ⚠️ The following files contain hardcoded API keys that should be immediately addressed: 1. **`checks/HM_parse.py:14`** ```python os.environ['LLAMA_CLOUD_API_KEY'] = 'llx-BmHqsgAhrUWpNJDhl25POaxe0WvwwyiwHcRpACKbJch50Lu2' ``` 2. **`checks/analyze_with_gpt.py:13`** ```python client = OpenAI(api_key="sk-svcacct-yRvRUPzN0Bq2-CJgZl4tgklRcHCfBsiMUhbK308vyQj91q-Q3wqfEHlBPXZ6QyeryHT3BlbkFJxErLrQ1ycFtrcU0xoXXxweoMwcUKxpQSNiN98L9d4AtIlmnNQtotgeuBf2iqpg7_AA") ``` 3. **`checks/HM_censorship.py:11`** ```python os.environ["OPENAI_API_KEY"] = "sk-proj-LaFeLI2v1p9TkGOIifAJT3BlbkFJk7SuBc0VkmmrRt5y9cQg" ``` ### Recommended Security Improvements 1. **Use Environment Variables:** ```python # Replace hardcoded keys with: import os OPENAI_API_KEY = os.getenv('OPENAI_API_KEY') LLAMA_CLOUD_API_KEY = os.getenv('LLAMA_CLOUD_API_KEY') ``` 2. **Use .env Files (with python-dotenv):** ```python from dotenv import load_dotenv load_dotenv() OPENAI_API_KEY = os.getenv('OPENAI_API_KEY') ``` 3. **Secret Management Systems:** - AWS Secrets Manager - HashiCorp Vault - Azure Key Vault 4. **Rotate Exposed Keys:** - All hardcoded keys should be rotated immediately - Monitor API usage for unauthorized access ### Additional Security Considerations - **File System Access**: The system writes to `/tmp/HM_working` and `/opt/QC/reports/` - ensure proper permissions - **Box Authentication**: JWT config contains sensitive credentials - protect `ford_box_config.json` - **Working Directory Cleanup**: Temporary files contain potentially sensitive document content - **Context Snapshots**: Be cautious about logging context data (disabled by default in `qc_module.py:117`) --- ## Development ### Adding a New Check Module 1. **Create check file in `checks/` directory:** ```python # checks/my_new_check.py def run_check(config: dict, context: dict, check_id: str) -> dict: # 1. Get required data from context input_data = context.get("previous_check", {}) # 2. Perform validation logic result = validate_something(input_data) # 3. Store results in context context[check_id] = { "validation_result": result } # 4. Return status if result: return {"status": "passed", "details": {...}} else: return {"status": "error", "error_message": "..."} ``` 2. **Add to profile JSON:** ```json { "id": "my_new_check", "script": "checks.my_new_check", "config": { "description": "Validates something important", "param1": "value1" } } ``` ### Testing **Manual Testing:** ```bash python launchers/HM_launcher_CLI.py test_file.pdf ./test_reports/ ``` **Context Debugging:** Uncomment line 117 in `qc_module.py` to enable context snapshots: ```python aggregated_results["context_snapshot"] = context ``` --- ## Troubleshooting ### Common Issues **1. "Module not found" errors** - Ensure `/opt/QC` is in your Python path - Check `sys.path.append()` statements in launchers **2. "Working directory not found"** - Create `/tmp/HM_working` manually: `mkdir -p /tmp/HM_working` - Verify permissions on working directories **3. API authentication failures** - Verify API keys are valid and not expired - Check API usage limits/quotas - Ensure proper environment variable configuration **4. Box integration issues** - Validate JWT config file path - Ensure service account has folder access - Check Box folder IDs are correct **5. PDF parsing failures** - Verify LlamaParse API key and quota - Check PDF is not corrupted or password-protected - Ensure sufficient disk space in working directory **6. Censorship check failures** - Verify training set images exist in `/opt/QC/supporting/censorship_trainset/` - Ensure images follow naming convention (`*-C.png` for censored, `*-U.png` for uncensored) - Check DSPy model configuration ### Logging **CLI Launcher:** Uses Python `logging` module at INFO level **Box Hotfolder:** Logs to `log/ford_qc_script.log` **Increasing Verbosity:** ```python logging.basicConfig(level=logging.DEBUG) ``` --- ## Dependencies ### Core Dependencies - **Python 3.8+** - **llama-parse**: PDF parsing and image extraction - **openai**: GPT-4 API integration - **dspy**: AI-powered decision systems - **PIL/Pillow**: Image processing - **boxsdk**: Box cloud storage integration - **nest_asyncio**: Async event loop management ### Full Dependency List See `requirements.txt` for complete list including: - `fastapi`, `uvicorn`, `httpx` (web framework components) - `pandas`, `numpy` (data processing) - `nltk`, `tiktoken` (text processing) - `APScheduler` (task scheduling) - `datasets`, `optuna` (ML/AI components) --- ## System Requirements ### Hardware - **CPU**: Multi-core recommended for DSPy optimization - **RAM**: 4GB minimum, 8GB+ recommended - **Storage**: 500MB + space for temporary files ### Operating System - Linux (Ubuntu 20.04+) - macOS (tested on Darwin 24.5.0) - Windows (with path adjustments) ### Python Version - Python 3.8 - 3.11 (tested) - Not tested with Python 3.12+ --- ## Roadmap / Known Limitations ### Current Limitations 1. ✅ ~~Hardcoded file paths~~ - **FIXED** with environment configuration (`config.py`) 2. Hardcoded API keys in source code 3. Limited error recovery in check chains 4. No parallel check execution 5. Box hotfolder requires manual startup ### Recent Improvements - ✅ **Environment Configuration** - Dev/production mode support - ✅ **Image QC Support** - Full validation for JPG, PNG, PSD files - ✅ **Image Dimension Check** - Validates pixel dimensions against filename - ✅ **Enhanced Imprint Check** - Now includes country code validation - ✅ **Fixed Censorship Rules** - CEN-only checking (GEN files skip) ### Future Improvements - Parallel check execution where possible - Retry logic for API failures - Real-time progress reporting - Dashboard for QC history - Support for additional file formats - Externalize API keys to environment variables --- ## License [Specify license here] --- ## Contact / Support For issues or questions: - Check existing documentation in `CLAUDE.md` - Review code comments in individual modules - Contact development team --- ## Appendix: File Reference ### Critical Files | File | Purpose | Contains Sensitive Data | |------|---------|------------------------| | `qc_module.py` | Core orchestration engine | No | | `checks/analyze_with_gpt.py` | OpenAI integration | ⚠️ API Key | | `checks/HM_parse.py` | PDF parsing | ⚠️ API Key | | `checks/HM_censorship.py` | Censorship detection | ⚠️ API Key | | `launchers/ford_qc_box_hotfolder_process.py` | Box integration | ⚠️ Box Config Path | | `profiles/HM.json` | H&M check configuration | No | | `ford_box_config.json` | Box JWT credentials | ⚠️ Yes | --- **Last Updated:** 2025-01-12 **Version:** 1.2 **Maintainer:** H&M QC Team ## Recent Updates (v1.2) - Added development environment configuration - Implemented image QC checks (JPG, PNG, PSD support) - Added image dimension validation - Enhanced imprint check with country code validation - Fixed censorship rules (CEN-only, GEN skip) - Environment-aware path configuration