No description
Find a file
2025-11-14 17:30:42 +02:00
checks Updated PDF QC and addition to Image QC 2025-11-13 13:41:31 +02:00
launchers Updated PDF QC and addition to Image QC 2025-11-13 13:41:31 +02:00
profiles Updated PDF QC and addition to Image QC 2025-11-13 13:41:31 +02:00
supporting/censorship_trainset initial commit 2025-09-30 10:37:12 -05:00
utils initial commit 2025-09-30 10:37:12 -05:00
.gitignore initial commit 2025-09-30 10:37:12 -05:00
CLAUDE.md Updated PDF QC and addition to Image QC 2025-11-13 13:41:31 +02:00
config.py Updated config base path 2025-11-14 17:30:42 +02:00
DEV_SETUP.md Updated PDF QC and addition to Image QC 2025-11-13 13:41:31 +02:00
qc_module.py Updated config base path 2025-11-14 17:30:42 +02:00
README.md Updated PDF QC and addition to Image QC 2025-11-13 13:41:31 +02:00
requirements.txt Updated config base path 2025-11-14 17:30:42 +02:00

H&M Quality Control (HMQC) System

Overview

The H&M Quality Control (HMQC) system is a modular Python application designed to perform automated quality control checks on both PDF documents and static images (JPG, PNG, PSD) for H&M marketing assets. The system validates assets against multiple criteria including filename formatting, image dimensions, imprint verification, language validation, pricing accuracy, and censorship requirements.

Table of Contents


Architecture

The HMQC system follows a modular, profile-based architecture:

┌─────────────┐
│  Launcher   │ (CLI or Box Hotfolder)
└──────┬──────┘
       │
       ▼
┌─────────────┐
│  QC Module  │ (Core Engine)
└──────┬──────┘
       │
       ├──► Profile JSON (defines checks)
       │
       ├──► Check Modules (modular validation)
       │       ├─ HM_parse
       │       ├─ HM_filename_parse
       │       ├─ HM_imprint_check
       │       ├─ HM_language_validate
       │       ├─ HM_price_currency_check
       │       └─ HM_censorship
       │
       └──► HTML Reporter (generates reports)

Core Principles

  1. Modularity: Each check is an independent module implementing a standard interface
  2. Context Sharing: Checks share data via a context dictionary for inter-check dependencies
  3. Profile-Based Configuration: JSON profiles define which checks to run and their parameters
  4. Standardized Results: All checks return consistent status objects (passed, error, skipped)

Project Structure

hm_qc/
├── qc_module.py              # Core QC engine
├── config.py                 # Environment configuration (NEW)
├── requirements.txt          # Python dependencies
├── README.md                 # This file
├── CLAUDE.md                 # Project instructions for Claude
├── DEV_SETUP.md              # Development environment guide (NEW)
│
├── launchers/                # Execution entry points
│   ├── HM_launcher_CLI.py    # Command-line interface (environment-aware)
│   └── ford_qc_box_hotfolder_process.py  # Box integration
│
├── checks/                   # QC check modules
│   ├── HM_parse.py           # PDF parsing with LlamaParse
│   ├── HM_filename_parse.py  # PDF filename validation
│   ├── HM_imprint_check.py   # Imprint verification (includes country code)
│   ├── HM_image_parse.py     # Image loading and processing (NEW)
│   ├── HM_image_filename_parse.py  # Image filename parsing (NEW)
│   ├── HM_image_dimension_check.py # Image dimension validation (NEW)
│   ├── HM_language_validate.py     # Language detection
│   ├── HM_price_currency_check.py  # Currency validation
│   ├── HM_censorship.py      # Censorship compliance - CEN only (UPDATED)
│   ├── analyze_with_gpt.py   # OpenAI GPT integration
│   ├── html_reporter.py      # HTML report generation (environment-aware)
│   ├── business_data_check.py
│   ├── colour_existence_check.py
│   ├── file_size_check.py
│   ├── image_linking_check.py
│   ├── image_resolution_check.py
│   ├── missing_images_check.py
│   ├── special_requirements_mec_bau.py
│   └── unzip_and_verify_check.py
│
├── profiles/                 # Check configurations
│   ├── HM.json               # H&M PDF profile
│   ├── HM_image.json         # H&M image profile (NEW)
│   └── ford_bnp.json         # Ford BNP profile
│
├── supporting/               # Supporting assets
│   ├── censorship_trainset/  # Training images for censorship detection
│   └── HM_Pricing SLUSSEN_30-07-2024.pdf
│
├── tmp/                      # Development temporary files (gitignored)
│   ├── HM_working/           # Working directory (dev mode)
│   └── reports/              # Generated reports (dev mode)
│
└── input_bucket/             # Input file staging area

Key Components

1. QC Module (qc_module.py)

The core engine that orchestrates check execution.

Key Functions:

  • run_qc_profile(profile_path, input_file): Executes all checks defined in a profile
  • run_single_check(script, config, context, check_id): Runs an individual check module
  • run_qc_checks(profile_path, input_file, report_path): Full workflow including report generation

Context Management: Results from each check are stored in a shared context dictionary, enabling downstream checks to access data from upstream checks.

2. Launchers

CLI Launcher (HM_launcher_CLI.py)

Command-line interface for manual QC execution.

Usage:

python launchers/HM_launcher_CLI.py <path_to_input_file> <path_to_save_report_html>

Box Hotfolder Launcher (ford_qc_box_hotfolder_process.py)

Automated processing of files uploaded to Box folders.

Features:

  • Monitors Box folder for new .zip files
  • Downloads, processes, and deletes files
  • Uploads QC reports back to Box
  • File locking to prevent concurrent runs

3. Check Module Pattern

All check modules implement a standard interface:

def run_check(config: dict, context: dict, check_id: str) -> dict:
    """
    Args:
        config: Configuration parameters from profile JSON
        context: Shared context dictionary for inter-check data
        check_id: Unique identifier for this check

    Returns:
        {
            "status": "passed" | "error" | "skipped",
            "details": {...},
            "error_message": "..." (if status == "error")
        }
    """

4. HTML Reporter (html_reporter.py)

Generates Bootstrap-based HTML reports with:

  • Collapsible accordion for each check
  • Color-coded status badges
  • Nested detail formatting
  • Error highlighting

Installation

Prerequisites

  • Python 3.8+
  • pip package manager
  • OpenAI API key
  • LlamaParse API key

Setup

  1. Clone the repository:

    cd /path/to/deployment
    
  2. Install dependencies:

    pip install -r requirements.txt
    
  3. Configure API keys (see Security Considerations):

    • OpenAI API key for GPT analysis
    • LlamaParse API key for PDF parsing
  4. Set up paths:

    Update hardcoded paths in launcher scripts:

    # HM_launcher_CLI.py
    PROFILE_PATH = "/opt/QC/profiles/HM.json"
    
    # qc_module.py
    REPORTS_DIR = os.path.join(os.path.dirname(__file__), "reports")
    
  5. Create working directories:

    mkdir -p /tmp/HM_working
    mkdir -p /opt/QC/reports
    

Environment Configuration

The HMQC system supports two environments to enable safe local development without affecting production:

Development Environment

For local testing and development:

# Set environment variable
export HM_QC_ENV=dev

# Run checks (uses local paths in ./tmp/)
python launchers/HM_launcher_CLI.py test.pdf ./tmp/reports/report.html

Development paths:

  • Working directory: ./tmp/HM_working/
  • Reports: ./tmp/reports/
  • Supporting files: ./supporting/

Production Environment

For deployed production systems (default):

# Unset or use production mode
unset HM_QC_ENV
# or
export HM_QC_ENV=production

# Run checks (uses /opt/QC paths)
python launchers/HM_launcher_CLI.py input.pdf /opt/QC/reports/

Production paths:

  • Working directory: /tmp/HM_working/
  • Reports: /opt/QC/reports/
  • Supporting files: /opt/QC/supporting/

Configuration Module

The config.py module automatically handles environment detection:

import config

# Check current environment
print(config.ENVIRONMENT)  # 'dev' or 'production'

# Get paths
print(config.WORKING_DIR)
print(config.REPORTS_DIR)

See DEV_SETUP.md for complete development environment guide.


Configuration

Profile JSON Structure

Profiles define the sequence of checks and their parameters:

[
  {
    "id": "HM_parse",
    "script": "checks.HM_parse",
    "config": {
      "description": "Parses PDF with LlamaParse",
      "input_file": "supplied by launcher script",
      "working_dir": "/tmp/HM_working"
    }
  },
  {
    "id": "HM_filename_parse",
    "script": "checks.HM_filename_parse",
    "config": {
      "description": "Parses filename into components",
      "working_dir": "/tmp/HM_working"
    }
  }
]

Key Fields:

  • id: Unique check identifier (used for context storage)
  • script: Python module path (e.g., checks.HM_parse)
  • config: Check-specific parameters

H&M Filename Format

The system expects H&M filenames to follow these patterns:

Pattern 1: dimensions_format_year_reference-number_language-country.pdf

  • Example: 21.6x27.9cm_letter_2028_10062-01_en-us.pdf

Pattern 2: dimensions_format_reference-number_(GEN|CEN).pdf

  • Example: 10.8x14cm_quarter_letter_1001D_10004-02_GEN.pdf

Usage

Basic CLI Usage

python launchers/HM_launcher_CLI.py input.pdf /path/to/reports/

This will:

  1. Execute all checks defined in /opt/QC/profiles/HM.json
  2. Generate an HTML report in the specified directory
  3. Print JSON results to stdout

Box Hotfolder Integration

python launchers/ford_qc_box_hotfolder_process.py

Configuration:

  • Modify BOX_CLI_CONFIG_PATH to point to your Box JWT config
  • Set SOURCE_FOLDER_ID and REPORT_FOLDER_ID
  • Update PROFILE_PATH to your desired profile

Workflow:

  1. Script polls Box source folder for .zip files
  2. Downloads and processes each file
  3. Generates QC report
  4. Uploads report to Box reports folder
  5. Deletes processed files

Check Modules

The HMQC system includes separate check profiles for PDF documents and static images.

PDF Check Modules

HM_parse

Purpose: Parses PDF using LlamaParse API

Context Output:

  • extracted_text: Full text content
  • parsed_image: First page as PIL Image
  • all_images: List of all page images

Dependencies: LlamaParse API


HM_filename_parse

Purpose: Extracts components from H&M PDF filename using GPT

Context Input: HM_parse.filename

Context Output:

{
    "parsed": {
        "dimensions": "21.6x27.9cm",
        "format": "letter",
        "year": "2028",
        "reference": "9000_10107-06",  # Full reference with prefix
        "language": "en-us"
    }
}

Updated: Now correctly extracts full reference codes including numeric prefixes (e.g., 9000_10107-06)


HM_imprint_check

Purpose: Verifies imprint/reference code matches filename including country code

Context Input:

  • HM_parse.extracted_text
  • HM_filename_parse.parsed.reference
  • HM_filename_parse.parsed.language

Logic:

  1. Uses GPT to detect imprint in document
  2. Combines reference code with language/country code (e.g., 9000_10107-06_el-CY)
  3. Compares against detected imprint
  4. Skips for "OOH" (out-of-home) files

Updated: Now includes country code in validation to detect mismatches like CY vs GR


Image Check Modules

HM_image_parse

Purpose: Loads and processes static image files (JPG, PNG, PSD)

Context Output:

  • parsed_image: PIL Image object
  • filename: Original filename
  • image_format: Format (JPEG, PNG, PSD)
  • image_size: (width, height) in pixels
  • image_mode: Color mode (RGB, RGBA, etc.)

HM_image_filename_parse

Purpose: Parses image filename using pattern matching

Context Input: HM_image_parse.filename

Context Output:

{
    "parsed": {
        "language": "en-US",
        "format_type": "DOOH",
        "campaign_number": "4045",
        "dimensions": "1080x1920"
    }
}

Supported Formats:

  • SOME STATIC: Market_Language_campaignnumber_...
  • DOOH/OOH/Display: CampaignNumber_Type_Static_..._Size_Language-Market
  • POS GEN: Size_Format_Campaign_POPNumber_GEN
  • POS Country: Size_Format_Campaign_POPNumber_Language-Market
  • DS: Campaign_Name_Index_BU_Resolution_language-COUNTRY

HM_image_dimension_check (NEW)

Purpose: Validates actual image dimensions match filename specification

Context Input:

  • HM_image_parse.parsed_image
  • HM_image_filename_parse.parsed.dimensions

Logic:

  1. Extracts expected dimensions from filename (e.g., 1200x400)
  2. Gets actual image dimensions from PIL Image object
  3. Compares width × height in pixels

Example:

  • Filename: campaign_1200x400_en-US.jpg
  • Expected: 1200px × 400px
  • Actual: 1200px × 400px → PASS
  • Actual: 1500px × 600px → ERROR

Shared Check Modules

These checks work for both PDFs and images:

HM_language_validate

Purpose: Validates document/image language matches filename code

Context Input:

  • HM_parse.extracted_text OR HM_image_parse.parsed_image
  • HM_filename_parse.parsed.language OR HM_image_filename_parse.parsed.language

Context Output:

  • detected_language: ISO language code (e.g., "en-US")
  • matches: Boolean
  • isCensorshipRequired: True for CEN markets

Special Cases:

  • CEN: Censored market (requires body coverage checks)
  • GEN: General market (no censorship required)

HM_price_currency_check

Purpose: Validates currency matches regional expectations

Context Input:

  • Content (text or image)
  • Language/market code

Logic:

  1. Detects currency and price using multimodal GPT analysis
  2. Validates currency against region (e.g., USD for en-us)
  3. Skips for CEN/GEN markets

HM_censorship

Purpose: Verifies proper clothing coverage for CEN (censored) markets ONLY

Context Input:

  • parsed_image (from HM_parse or HM_image_parse)
  • parsed.language (from filename parse)

Process:

  1. SKIPS for GEN files - GEN assets do not require censorship checks
  2. SKIPS for standard market files - Only CEN files are checked
  3. RUNS for CEN files only - Validates body coverage using AI

Algorithm:

  1. Trains DSPy-based classifier using training images
  2. Analyzes document images for body coverage
  3. Validates against censorship requirements

Training Set: /supporting/censorship_trainset/

  • Censored images: *-C.png
  • Uncensored images: *-U.png

Technology: Uses DSPy with MIPROv2 optimization

Updated: Now correctly only runs checks on CEN files, not GEN files


API Integrations

1. LlamaParse

Purpose: PDF parsing and image extraction

Usage:

  • Text extraction: result_type="text"
  • Image generation: result_type="markdown" with multimodal model

Key in Code: checks/HM_parse.py:14


2. OpenAI GPT-4o

Purpose: Content analysis, language detection, validation

Usage:

  • Synchronous client in analyze_with_gpt.py
  • Multimodal support (text + images)
  • JSON and text response modes

Models Used:

  • gpt-4o: Primary analysis model
  • gpt-4o-mini: Used in DSPy for censorship detection

Key in Code: checks/analyze_with_gpt.py:13


3. DSPy

Purpose: AI-powered censorship detection

Components:

  • ImageDescription: Generates clothing coverage descriptions
  • ImageCensorshipDetection: Classifies censorship status
  • MIPROv2: Optimizes prompts based on training set

Key in Code: checks/HM_censorship.py:11


4. Box Python SDK

Purpose: Cloud storage integration for automated workflows

Authentication: JWT-based service account

Configuration File: ford_box_config.json


Security Considerations

CRITICAL ISSUES IDENTIFIED

⚠️ HARDCODED API KEYS FOUND ⚠️

The following files contain hardcoded API keys that should be immediately addressed:

  1. checks/HM_parse.py:14

    os.environ['LLAMA_CLOUD_API_KEY'] = 'llx-BmHqsgAhrUWpNJDhl25POaxe0WvwwyiwHcRpACKbJch50Lu2'
    
  2. checks/analyze_with_gpt.py:13

    client = OpenAI(api_key="sk-svcacct-yRvRUPzN0Bq2-CJgZl4tgklRcHCfBsiMUhbK308vyQj91q-Q3wqfEHlBPXZ6QyeryHT3BlbkFJxErLrQ1ycFtrcU0xoXXxweoMwcUKxpQSNiN98L9d4AtIlmnNQtotgeuBf2iqpg7_AA")
    
  3. checks/HM_censorship.py:11

    os.environ["OPENAI_API_KEY"] = "sk-proj-LaFeLI2v1p9TkGOIifAJT3BlbkFJk7SuBc0VkmmrRt5y9cQg"
    
  1. Use Environment Variables:

    # Replace hardcoded keys with:
    import os
    OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')
    LLAMA_CLOUD_API_KEY = os.getenv('LLAMA_CLOUD_API_KEY')
    
  2. Use .env Files (with python-dotenv):

    from dotenv import load_dotenv
    load_dotenv()
    
    OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')
    
  3. Secret Management Systems:

    • AWS Secrets Manager
    • HashiCorp Vault
    • Azure Key Vault
  4. Rotate Exposed Keys:

    • All hardcoded keys should be rotated immediately
    • Monitor API usage for unauthorized access

Additional Security Considerations

  • File System Access: The system writes to /tmp/HM_working and /opt/QC/reports/ - ensure proper permissions
  • Box Authentication: JWT config contains sensitive credentials - protect ford_box_config.json
  • Working Directory Cleanup: Temporary files contain potentially sensitive document content
  • Context Snapshots: Be cautious about logging context data (disabled by default in qc_module.py:117)

Development

Adding a New Check Module

  1. Create check file in checks/ directory:

    # checks/my_new_check.py
    
    def run_check(config: dict, context: dict, check_id: str) -> dict:
        # 1. Get required data from context
        input_data = context.get("previous_check", {})
    
        # 2. Perform validation logic
        result = validate_something(input_data)
    
        # 3. Store results in context
        context[check_id] = {
            "validation_result": result
        }
    
        # 4. Return status
        if result:
            return {"status": "passed", "details": {...}}
        else:
            return {"status": "error", "error_message": "..."}
    
  2. Add to profile JSON:

    {
        "id": "my_new_check",
        "script": "checks.my_new_check",
        "config": {
            "description": "Validates something important",
            "param1": "value1"
        }
    }
    

Testing

Manual Testing:

python launchers/HM_launcher_CLI.py test_file.pdf ./test_reports/

Context Debugging:

Uncomment line 117 in qc_module.py to enable context snapshots:

aggregated_results["context_snapshot"] = context

Troubleshooting

Common Issues

1. "Module not found" errors

  • Ensure /opt/QC is in your Python path
  • Check sys.path.append() statements in launchers

2. "Working directory not found"

  • Create /tmp/HM_working manually: mkdir -p /tmp/HM_working
  • Verify permissions on working directories

3. API authentication failures

  • Verify API keys are valid and not expired
  • Check API usage limits/quotas
  • Ensure proper environment variable configuration

4. Box integration issues

  • Validate JWT config file path
  • Ensure service account has folder access
  • Check Box folder IDs are correct

5. PDF parsing failures

  • Verify LlamaParse API key and quota
  • Check PDF is not corrupted or password-protected
  • Ensure sufficient disk space in working directory

6. Censorship check failures

  • Verify training set images exist in /opt/QC/supporting/censorship_trainset/
  • Ensure images follow naming convention (*-C.png for censored, *-U.png for uncensored)
  • Check DSPy model configuration

Logging

CLI Launcher: Uses Python logging module at INFO level

Box Hotfolder: Logs to log/ford_qc_script.log

Increasing Verbosity:

logging.basicConfig(level=logging.DEBUG)

Dependencies

Core Dependencies

  • Python 3.8+
  • llama-parse: PDF parsing and image extraction
  • openai: GPT-4 API integration
  • dspy: AI-powered decision systems
  • PIL/Pillow: Image processing
  • boxsdk: Box cloud storage integration
  • nest_asyncio: Async event loop management

Full Dependency List

See requirements.txt for complete list including:

  • fastapi, uvicorn, httpx (web framework components)
  • pandas, numpy (data processing)
  • nltk, tiktoken (text processing)
  • APScheduler (task scheduling)
  • datasets, optuna (ML/AI components)

System Requirements

Hardware

  • CPU: Multi-core recommended for DSPy optimization
  • RAM: 4GB minimum, 8GB+ recommended
  • Storage: 500MB + space for temporary files

Operating System

  • Linux (Ubuntu 20.04+)
  • macOS (tested on Darwin 24.5.0)
  • Windows (with path adjustments)

Python Version

  • Python 3.8 - 3.11 (tested)
  • Not tested with Python 3.12+

Roadmap / Known Limitations

Current Limitations

  1. Hardcoded file paths - FIXED with environment configuration (config.py)
  2. Hardcoded API keys in source code
  3. Limited error recovery in check chains
  4. No parallel check execution
  5. Box hotfolder requires manual startup

Recent Improvements

  • Environment Configuration - Dev/production mode support
  • Image QC Support - Full validation for JPG, PNG, PSD files
  • Image Dimension Check - Validates pixel dimensions against filename
  • Enhanced Imprint Check - Now includes country code validation
  • Fixed Censorship Rules - CEN-only checking (GEN files skip)

Future Improvements

  • Parallel check execution where possible
  • Retry logic for API failures
  • Real-time progress reporting
  • Dashboard for QC history
  • Support for additional file formats
  • Externalize API keys to environment variables

License

[Specify license here]


Contact / Support

For issues or questions:

  • Check existing documentation in CLAUDE.md
  • Review code comments in individual modules
  • Contact development team

Appendix: File Reference

Critical Files

File Purpose Contains Sensitive Data
qc_module.py Core orchestration engine No
checks/analyze_with_gpt.py OpenAI integration ⚠️ API Key
checks/HM_parse.py PDF parsing ⚠️ API Key
checks/HM_censorship.py Censorship detection ⚠️ API Key
launchers/ford_qc_box_hotfolder_process.py Box integration ⚠️ Box Config Path
profiles/HM.json H&M check configuration No
ford_box_config.json Box JWT credentials ⚠️ Yes

Last Updated: 2025-01-12 Version: 1.2 Maintainer: H&M QC Team

Recent Updates (v1.2)

  • Added development environment configuration
  • Implemented image QC checks (JPG, PNG, PSD support)
  • Added image dimension validation
  • Enhanced imprint check with country code validation
  • Fixed censorship rules (CEN-only, GEN skip)
  • Environment-aware path configuration