| checks | ||
| launchers | ||
| profiles | ||
| supporting/censorship_trainset | ||
| utils | ||
| .gitignore | ||
| CLAUDE.md | ||
| config.py | ||
| DEV_SETUP.md | ||
| qc_module.py | ||
| README.md | ||
| requirements.txt | ||
H&M Quality Control (HMQC) System
Overview
The H&M Quality Control (HMQC) system is a modular Python application designed to perform automated quality control checks on both PDF documents and static images (JPG, PNG, PSD) for H&M marketing assets. The system validates assets against multiple criteria including filename formatting, image dimensions, imprint verification, language validation, pricing accuracy, and censorship requirements.
Quick Links
- Development Setup Guide - Setting up local dev environment
- Project Instructions - Development guidelines and architecture
- Check Modules - Detailed check documentation
Table of Contents
- Architecture
- Project Structure
- Key Components
- Installation
- Environment Configuration
- Configuration
- Usage
- Check Modules
- API Integrations
- Security Considerations
- Development
- Troubleshooting
Architecture
The HMQC system follows a modular, profile-based architecture:
┌─────────────┐
│ Launcher │ (CLI or Box Hotfolder)
└──────┬──────┘
│
▼
┌─────────────┐
│ QC Module │ (Core Engine)
└──────┬──────┘
│
├──► Profile JSON (defines checks)
│
├──► Check Modules (modular validation)
│ ├─ HM_parse
│ ├─ HM_filename_parse
│ ├─ HM_imprint_check
│ ├─ HM_language_validate
│ ├─ HM_price_currency_check
│ └─ HM_censorship
│
└──► HTML Reporter (generates reports)
Core Principles
- Modularity: Each check is an independent module implementing a standard interface
- Context Sharing: Checks share data via a context dictionary for inter-check dependencies
- Profile-Based Configuration: JSON profiles define which checks to run and their parameters
- Standardized Results: All checks return consistent status objects (
passed,error,skipped)
Project Structure
hm_qc/
├── qc_module.py # Core QC engine
├── config.py # Environment configuration (NEW)
├── requirements.txt # Python dependencies
├── README.md # This file
├── CLAUDE.md # Project instructions for Claude
├── DEV_SETUP.md # Development environment guide (NEW)
│
├── launchers/ # Execution entry points
│ ├── HM_launcher_CLI.py # Command-line interface (environment-aware)
│ └── ford_qc_box_hotfolder_process.py # Box integration
│
├── checks/ # QC check modules
│ ├── HM_parse.py # PDF parsing with LlamaParse
│ ├── HM_filename_parse.py # PDF filename validation
│ ├── HM_imprint_check.py # Imprint verification (includes country code)
│ ├── HM_image_parse.py # Image loading and processing (NEW)
│ ├── HM_image_filename_parse.py # Image filename parsing (NEW)
│ ├── HM_image_dimension_check.py # Image dimension validation (NEW)
│ ├── HM_language_validate.py # Language detection
│ ├── HM_price_currency_check.py # Currency validation
│ ├── HM_censorship.py # Censorship compliance - CEN only (UPDATED)
│ ├── analyze_with_gpt.py # OpenAI GPT integration
│ ├── html_reporter.py # HTML report generation (environment-aware)
│ ├── business_data_check.py
│ ├── colour_existence_check.py
│ ├── file_size_check.py
│ ├── image_linking_check.py
│ ├── image_resolution_check.py
│ ├── missing_images_check.py
│ ├── special_requirements_mec_bau.py
│ └── unzip_and_verify_check.py
│
├── profiles/ # Check configurations
│ ├── HM.json # H&M PDF profile
│ ├── HM_image.json # H&M image profile (NEW)
│ └── ford_bnp.json # Ford BNP profile
│
├── supporting/ # Supporting assets
│ ├── censorship_trainset/ # Training images for censorship detection
│ └── HM_Pricing SLUSSEN_30-07-2024.pdf
│
├── tmp/ # Development temporary files (gitignored)
│ ├── HM_working/ # Working directory (dev mode)
│ └── reports/ # Generated reports (dev mode)
│
└── input_bucket/ # Input file staging area
Key Components
1. QC Module (qc_module.py)
The core engine that orchestrates check execution.
Key Functions:
run_qc_profile(profile_path, input_file): Executes all checks defined in a profilerun_single_check(script, config, context, check_id): Runs an individual check modulerun_qc_checks(profile_path, input_file, report_path): Full workflow including report generation
Context Management:
Results from each check are stored in a shared context dictionary, enabling downstream checks to access data from upstream checks.
2. Launchers
CLI Launcher (HM_launcher_CLI.py)
Command-line interface for manual QC execution.
Usage:
python launchers/HM_launcher_CLI.py <path_to_input_file> <path_to_save_report_html>
Box Hotfolder Launcher (ford_qc_box_hotfolder_process.py)
Automated processing of files uploaded to Box folders.
Features:
- Monitors Box folder for new
.zipfiles - Downloads, processes, and deletes files
- Uploads QC reports back to Box
- File locking to prevent concurrent runs
3. Check Module Pattern
All check modules implement a standard interface:
def run_check(config: dict, context: dict, check_id: str) -> dict:
"""
Args:
config: Configuration parameters from profile JSON
context: Shared context dictionary for inter-check data
check_id: Unique identifier for this check
Returns:
{
"status": "passed" | "error" | "skipped",
"details": {...},
"error_message": "..." (if status == "error")
}
"""
4. HTML Reporter (html_reporter.py)
Generates Bootstrap-based HTML reports with:
- Collapsible accordion for each check
- Color-coded status badges
- Nested detail formatting
- Error highlighting
Installation
Prerequisites
- Python 3.8+
- pip package manager
- OpenAI API key
- LlamaParse API key
Setup
-
Clone the repository:
cd /path/to/deployment -
Install dependencies:
pip install -r requirements.txt -
Configure API keys (see Security Considerations):
- OpenAI API key for GPT analysis
- LlamaParse API key for PDF parsing
-
Set up paths:
Update hardcoded paths in launcher scripts:
# HM_launcher_CLI.py PROFILE_PATH = "/opt/QC/profiles/HM.json" # qc_module.py REPORTS_DIR = os.path.join(os.path.dirname(__file__), "reports") -
Create working directories:
mkdir -p /tmp/HM_working mkdir -p /opt/QC/reports
Environment Configuration
The HMQC system supports two environments to enable safe local development without affecting production:
Development Environment
For local testing and development:
# Set environment variable
export HM_QC_ENV=dev
# Run checks (uses local paths in ./tmp/)
python launchers/HM_launcher_CLI.py test.pdf ./tmp/reports/report.html
Development paths:
- Working directory:
./tmp/HM_working/ - Reports:
./tmp/reports/ - Supporting files:
./supporting/
Production Environment
For deployed production systems (default):
# Unset or use production mode
unset HM_QC_ENV
# or
export HM_QC_ENV=production
# Run checks (uses /opt/QC paths)
python launchers/HM_launcher_CLI.py input.pdf /opt/QC/reports/
Production paths:
- Working directory:
/tmp/HM_working/ - Reports:
/opt/QC/reports/ - Supporting files:
/opt/QC/supporting/
Configuration Module
The config.py module automatically handles environment detection:
import config
# Check current environment
print(config.ENVIRONMENT) # 'dev' or 'production'
# Get paths
print(config.WORKING_DIR)
print(config.REPORTS_DIR)
See DEV_SETUP.md for complete development environment guide.
Configuration
Profile JSON Structure
Profiles define the sequence of checks and their parameters:
[
{
"id": "HM_parse",
"script": "checks.HM_parse",
"config": {
"description": "Parses PDF with LlamaParse",
"input_file": "supplied by launcher script",
"working_dir": "/tmp/HM_working"
}
},
{
"id": "HM_filename_parse",
"script": "checks.HM_filename_parse",
"config": {
"description": "Parses filename into components",
"working_dir": "/tmp/HM_working"
}
}
]
Key Fields:
id: Unique check identifier (used for context storage)script: Python module path (e.g.,checks.HM_parse)config: Check-specific parameters
H&M Filename Format
The system expects H&M filenames to follow these patterns:
Pattern 1: dimensions_format_year_reference-number_language-country.pdf
- Example:
21.6x27.9cm_letter_2028_10062-01_en-us.pdf
Pattern 2: dimensions_format_reference-number_(GEN|CEN).pdf
- Example:
10.8x14cm_quarter_letter_1001D_10004-02_GEN.pdf
Usage
Basic CLI Usage
python launchers/HM_launcher_CLI.py input.pdf /path/to/reports/
This will:
- Execute all checks defined in
/opt/QC/profiles/HM.json - Generate an HTML report in the specified directory
- Print JSON results to stdout
Box Hotfolder Integration
python launchers/ford_qc_box_hotfolder_process.py
Configuration:
- Modify
BOX_CLI_CONFIG_PATHto point to your Box JWT config - Set
SOURCE_FOLDER_IDandREPORT_FOLDER_ID - Update
PROFILE_PATHto your desired profile
Workflow:
- Script polls Box source folder for
.zipfiles - Downloads and processes each file
- Generates QC report
- Uploads report to Box reports folder
- Deletes processed files
Check Modules
The HMQC system includes separate check profiles for PDF documents and static images.
PDF Check Modules
HM_parse
Purpose: Parses PDF using LlamaParse API
Context Output:
extracted_text: Full text contentparsed_image: First page as PIL Imageall_images: List of all page images
Dependencies: LlamaParse API
HM_filename_parse
Purpose: Extracts components from H&M PDF filename using GPT
Context Input: HM_parse.filename
Context Output:
{
"parsed": {
"dimensions": "21.6x27.9cm",
"format": "letter",
"year": "2028",
"reference": "9000_10107-06", # Full reference with prefix
"language": "en-us"
}
}
Updated: Now correctly extracts full reference codes including numeric prefixes (e.g., 9000_10107-06)
HM_imprint_check
Purpose: Verifies imprint/reference code matches filename including country code
Context Input:
HM_parse.extracted_textHM_filename_parse.parsed.referenceHM_filename_parse.parsed.language
Logic:
- Uses GPT to detect imprint in document
- Combines reference code with language/country code (e.g.,
9000_10107-06_el-CY) - Compares against detected imprint
- Skips for "OOH" (out-of-home) files
Updated: Now includes country code in validation to detect mismatches like CY vs GR
Image Check Modules
HM_image_parse
Purpose: Loads and processes static image files (JPG, PNG, PSD)
Context Output:
parsed_image: PIL Image objectfilename: Original filenameimage_format: Format (JPEG, PNG, PSD)image_size: (width, height) in pixelsimage_mode: Color mode (RGB, RGBA, etc.)
HM_image_filename_parse
Purpose: Parses image filename using pattern matching
Context Input: HM_image_parse.filename
Context Output:
{
"parsed": {
"language": "en-US",
"format_type": "DOOH",
"campaign_number": "4045",
"dimensions": "1080x1920"
}
}
Supported Formats:
- SOME STATIC:
Market_Language_campaignnumber_... - DOOH/OOH/Display:
CampaignNumber_Type_Static_..._Size_Language-Market - POS GEN:
Size_Format_Campaign_POPNumber_GEN - POS Country:
Size_Format_Campaign_POPNumber_Language-Market - DS:
Campaign_Name_Index_BU_Resolution_language-COUNTRY
HM_image_dimension_check (NEW)
Purpose: Validates actual image dimensions match filename specification
Context Input:
HM_image_parse.parsed_imageHM_image_filename_parse.parsed.dimensions
Logic:
- Extracts expected dimensions from filename (e.g.,
1200x400) - Gets actual image dimensions from PIL Image object
- Compares width × height in pixels
Example:
- Filename:
campaign_1200x400_en-US.jpg - Expected: 1200px × 400px
- Actual: 1200px × 400px → ✅ PASS
- Actual: 1500px × 600px → ❌ ERROR
Shared Check Modules
These checks work for both PDFs and images:
HM_language_validate
Purpose: Validates document/image language matches filename code
Context Input:
HM_parse.extracted_textORHM_image_parse.parsed_imageHM_filename_parse.parsed.languageORHM_image_filename_parse.parsed.language
Context Output:
detected_language: ISO language code (e.g., "en-US")matches: BooleanisCensorshipRequired: True for CEN markets
Special Cases:
CEN: Censored market (requires body coverage checks)GEN: General market (no censorship required)
HM_price_currency_check
Purpose: Validates currency matches regional expectations
Context Input:
- Content (text or image)
- Language/market code
Logic:
- Detects currency and price using multimodal GPT analysis
- Validates currency against region (e.g., USD for en-us)
- Skips for CEN/GEN markets
HM_censorship
Purpose: Verifies proper clothing coverage for CEN (censored) markets ONLY
Context Input:
parsed_image(from HM_parse or HM_image_parse)parsed.language(from filename parse)
Process:
- SKIPS for GEN files - GEN assets do not require censorship checks
- SKIPS for standard market files - Only CEN files are checked
- RUNS for CEN files only - Validates body coverage using AI
Algorithm:
- Trains DSPy-based classifier using training images
- Analyzes document images for body coverage
- Validates against censorship requirements
Training Set: /supporting/censorship_trainset/
- Censored images:
*-C.png - Uncensored images:
*-U.png
Technology: Uses DSPy with MIPROv2 optimization
Updated: Now correctly only runs checks on CEN files, not GEN files
API Integrations
1. LlamaParse
Purpose: PDF parsing and image extraction
Usage:
- Text extraction:
result_type="text" - Image generation:
result_type="markdown"with multimodal model
Key in Code: checks/HM_parse.py:14
2. OpenAI GPT-4o
Purpose: Content analysis, language detection, validation
Usage:
- Synchronous client in
analyze_with_gpt.py - Multimodal support (text + images)
- JSON and text response modes
Models Used:
gpt-4o: Primary analysis modelgpt-4o-mini: Used in DSPy for censorship detection
Key in Code: checks/analyze_with_gpt.py:13
3. DSPy
Purpose: AI-powered censorship detection
Components:
ImageDescription: Generates clothing coverage descriptionsImageCensorshipDetection: Classifies censorship statusMIPROv2: Optimizes prompts based on training set
Key in Code: checks/HM_censorship.py:11
4. Box Python SDK
Purpose: Cloud storage integration for automated workflows
Authentication: JWT-based service account
Configuration File: ford_box_config.json
Security Considerations
CRITICAL ISSUES IDENTIFIED
⚠️ HARDCODED API KEYS FOUND ⚠️
The following files contain hardcoded API keys that should be immediately addressed:
-
checks/HM_parse.py:14os.environ['LLAMA_CLOUD_API_KEY'] = 'llx-BmHqsgAhrUWpNJDhl25POaxe0WvwwyiwHcRpACKbJch50Lu2' -
checks/analyze_with_gpt.py:13client = OpenAI(api_key="sk-svcacct-yRvRUPzN0Bq2-CJgZl4tgklRcHCfBsiMUhbK308vyQj91q-Q3wqfEHlBPXZ6QyeryHT3BlbkFJxErLrQ1ycFtrcU0xoXXxweoMwcUKxpQSNiN98L9d4AtIlmnNQtotgeuBf2iqpg7_AA") -
checks/HM_censorship.py:11os.environ["OPENAI_API_KEY"] = "sk-proj-LaFeLI2v1p9TkGOIifAJT3BlbkFJk7SuBc0VkmmrRt5y9cQg"
Recommended Security Improvements
-
Use Environment Variables:
# Replace hardcoded keys with: import os OPENAI_API_KEY = os.getenv('OPENAI_API_KEY') LLAMA_CLOUD_API_KEY = os.getenv('LLAMA_CLOUD_API_KEY') -
Use .env Files (with python-dotenv):
from dotenv import load_dotenv load_dotenv() OPENAI_API_KEY = os.getenv('OPENAI_API_KEY') -
Secret Management Systems:
- AWS Secrets Manager
- HashiCorp Vault
- Azure Key Vault
-
Rotate Exposed Keys:
- All hardcoded keys should be rotated immediately
- Monitor API usage for unauthorized access
Additional Security Considerations
- File System Access: The system writes to
/tmp/HM_workingand/opt/QC/reports/- ensure proper permissions - Box Authentication: JWT config contains sensitive credentials - protect
ford_box_config.json - Working Directory Cleanup: Temporary files contain potentially sensitive document content
- Context Snapshots: Be cautious about logging context data (disabled by default in
qc_module.py:117)
Development
Adding a New Check Module
-
Create check file in
checks/directory:# checks/my_new_check.py def run_check(config: dict, context: dict, check_id: str) -> dict: # 1. Get required data from context input_data = context.get("previous_check", {}) # 2. Perform validation logic result = validate_something(input_data) # 3. Store results in context context[check_id] = { "validation_result": result } # 4. Return status if result: return {"status": "passed", "details": {...}} else: return {"status": "error", "error_message": "..."} -
Add to profile JSON:
{ "id": "my_new_check", "script": "checks.my_new_check", "config": { "description": "Validates something important", "param1": "value1" } }
Testing
Manual Testing:
python launchers/HM_launcher_CLI.py test_file.pdf ./test_reports/
Context Debugging:
Uncomment line 117 in qc_module.py to enable context snapshots:
aggregated_results["context_snapshot"] = context
Troubleshooting
Common Issues
1. "Module not found" errors
- Ensure
/opt/QCis in your Python path - Check
sys.path.append()statements in launchers
2. "Working directory not found"
- Create
/tmp/HM_workingmanually:mkdir -p /tmp/HM_working - Verify permissions on working directories
3. API authentication failures
- Verify API keys are valid and not expired
- Check API usage limits/quotas
- Ensure proper environment variable configuration
4. Box integration issues
- Validate JWT config file path
- Ensure service account has folder access
- Check Box folder IDs are correct
5. PDF parsing failures
- Verify LlamaParse API key and quota
- Check PDF is not corrupted or password-protected
- Ensure sufficient disk space in working directory
6. Censorship check failures
- Verify training set images exist in
/opt/QC/supporting/censorship_trainset/ - Ensure images follow naming convention (
*-C.pngfor censored,*-U.pngfor uncensored) - Check DSPy model configuration
Logging
CLI Launcher: Uses Python logging module at INFO level
Box Hotfolder: Logs to log/ford_qc_script.log
Increasing Verbosity:
logging.basicConfig(level=logging.DEBUG)
Dependencies
Core Dependencies
- Python 3.8+
- llama-parse: PDF parsing and image extraction
- openai: GPT-4 API integration
- dspy: AI-powered decision systems
- PIL/Pillow: Image processing
- boxsdk: Box cloud storage integration
- nest_asyncio: Async event loop management
Full Dependency List
See requirements.txt for complete list including:
fastapi,uvicorn,httpx(web framework components)pandas,numpy(data processing)nltk,tiktoken(text processing)APScheduler(task scheduling)datasets,optuna(ML/AI components)
System Requirements
Hardware
- CPU: Multi-core recommended for DSPy optimization
- RAM: 4GB minimum, 8GB+ recommended
- Storage: 500MB + space for temporary files
Operating System
- Linux (Ubuntu 20.04+)
- macOS (tested on Darwin 24.5.0)
- Windows (with path adjustments)
Python Version
- Python 3.8 - 3.11 (tested)
- Not tested with Python 3.12+
Roadmap / Known Limitations
Current Limitations
- ✅
Hardcoded file paths- FIXED with environment configuration (config.py) - Hardcoded API keys in source code
- Limited error recovery in check chains
- No parallel check execution
- Box hotfolder requires manual startup
Recent Improvements
- ✅ Environment Configuration - Dev/production mode support
- ✅ Image QC Support - Full validation for JPG, PNG, PSD files
- ✅ Image Dimension Check - Validates pixel dimensions against filename
- ✅ Enhanced Imprint Check - Now includes country code validation
- ✅ Fixed Censorship Rules - CEN-only checking (GEN files skip)
Future Improvements
- Parallel check execution where possible
- Retry logic for API failures
- Real-time progress reporting
- Dashboard for QC history
- Support for additional file formats
- Externalize API keys to environment variables
License
[Specify license here]
Contact / Support
For issues or questions:
- Check existing documentation in
CLAUDE.md - Review code comments in individual modules
- Contact development team
Appendix: File Reference
Critical Files
| File | Purpose | Contains Sensitive Data |
|---|---|---|
qc_module.py |
Core orchestration engine | No |
checks/analyze_with_gpt.py |
OpenAI integration | ⚠️ API Key |
checks/HM_parse.py |
PDF parsing | ⚠️ API Key |
checks/HM_censorship.py |
Censorship detection | ⚠️ API Key |
launchers/ford_qc_box_hotfolder_process.py |
Box integration | ⚠️ Box Config Path |
profiles/HM.json |
H&M check configuration | No |
ford_box_config.json |
Box JWT credentials | ⚠️ Yes |
Last Updated: 2025-01-12 Version: 1.2 Maintainer: H&M QC Team
Recent Updates (v1.2)
- Added development environment configuration
- Implemented image QC checks (JPG, PNG, PSD support)
- Added image dimension validation
- Enhanced imprint check with country code validation
- Fixed censorship rules (CEN-only, GEN skip)
- Environment-aware path configuration