No description
Find a file
2025-09-30 10:47:02 -05:00
checks initial commit 2025-09-30 10:37:12 -05:00
launchers initial commit 2025-09-30 10:37:12 -05:00
profiles initial commit 2025-09-30 10:37:12 -05:00
supporting/censorship_trainset initial commit 2025-09-30 10:37:12 -05:00
utils initial commit 2025-09-30 10:37:12 -05:00
.gitignore initial commit 2025-09-30 10:37:12 -05:00
CLAUDE.md initial commit 2025-09-30 10:37:12 -05:00
qc_module.py initial commit 2025-09-30 10:37:12 -05:00
README.md added readme 2025-09-30 10:47:02 -05:00
requirements.txt initial commit 2025-09-30 10:37:12 -05:00

H&M Quality Control (HMQC) System

Overview

The H&M Quality Control (HMQC) system is a modular Python application designed to perform automated quality control checks on PDF files for H&M marketing assets. The system validates assets against multiple criteria including filename formatting, imprint verification, language validation, pricing accuracy, and censorship requirements.

Table of Contents


Architecture

The HMQC system follows a modular, profile-based architecture:

┌─────────────┐
│  Launcher   │ (CLI or Box Hotfolder)
└──────┬──────┘
       │
       ▼
┌─────────────┐
│  QC Module  │ (Core Engine)
└──────┬──────┘
       │
       ├──► Profile JSON (defines checks)
       │
       ├──► Check Modules (modular validation)
       │       ├─ HM_parse
       │       ├─ HM_filename_parse
       │       ├─ HM_imprint_check
       │       ├─ HM_language_validate
       │       ├─ HM_price_currency_check
       │       └─ HM_censorship
       │
       └──► HTML Reporter (generates reports)

Core Principles

  1. Modularity: Each check is an independent module implementing a standard interface
  2. Context Sharing: Checks share data via a context dictionary for inter-check dependencies
  3. Profile-Based Configuration: JSON profiles define which checks to run and their parameters
  4. Standardized Results: All checks return consistent status objects (passed, error, skipped)

Project Structure

deployed/
├── qc_module.py              # Core QC engine
├── requirements.txt          # Python dependencies
├── CLAUDE.md                 # Development documentation
│
├── launchers/                # Execution entry points
│   ├── HM_launcher_CLI.py    # Command-line interface
│   └── ford_qc_box_hotfolder_process.py  # Box integration
│
├── checks/                   # QC check modules
│   ├── HM_parse.py           # PDF parsing with LlamaParse
│   ├── HM_filename_parse.py  # Filename validation
│   ├── HM_imprint_check.py   # Imprint verification
│   ├── HM_language_validate.py  # Language detection
│   ├── HM_price_currency_check.py  # Currency validation
│   ├── HM_censorship.py      # Censorship compliance (DSPy)
│   ├── analyze_with_gpt.py   # OpenAI GPT integration
│   ├── html_reporter.py      # HTML report generation
│   ├── business_data_check.py
│   ├── colour_existence_check.py
│   ├── file_size_check.py
│   ├── image_linking_check.py
│   ├── image_resolution_check.py
│   ├── missing_images_check.py
│   ├── special_requirements_mec_bau.py
│   └── unzip_and_verify_check.py
│
├── profiles/                 # Check configurations
│   ├── HM.json               # H&M standard profile
│   └── ford_bnp.json         # Ford BNP profile
│
├── supporting/               # Supporting assets
│   ├── censorship_trainset/  # Training images for censorship detection
│   └── HM_Pricing SLUSSEN_30-07-2024.pdf
│
├── utils/                    # Utilities
│   ├── report.py
│   ├── input_report.json
│   └── qc_report.html
│
└── input_bucket/             # Input file staging area

Key Components

1. QC Module (qc_module.py)

The core engine that orchestrates check execution.

Key Functions:

  • run_qc_profile(profile_path, input_file): Executes all checks defined in a profile
  • run_single_check(script, config, context, check_id): Runs an individual check module
  • run_qc_checks(profile_path, input_file, report_path): Full workflow including report generation

Context Management: Results from each check are stored in a shared context dictionary, enabling downstream checks to access data from upstream checks.

2. Launchers

CLI Launcher (HM_launcher_CLI.py)

Command-line interface for manual QC execution.

Usage:

python launchers/HM_launcher_CLI.py <path_to_input_file> <path_to_save_report_html>

Box Hotfolder Launcher (ford_qc_box_hotfolder_process.py)

Automated processing of files uploaded to Box folders.

Features:

  • Monitors Box folder for new .zip files
  • Downloads, processes, and deletes files
  • Uploads QC reports back to Box
  • File locking to prevent concurrent runs

3. Check Module Pattern

All check modules implement a standard interface:

def run_check(config: dict, context: dict, check_id: str) -> dict:
    """
    Args:
        config: Configuration parameters from profile JSON
        context: Shared context dictionary for inter-check data
        check_id: Unique identifier for this check

    Returns:
        {
            "status": "passed" | "error" | "skipped",
            "details": {...},
            "error_message": "..." (if status == "error")
        }
    """

4. HTML Reporter (html_reporter.py)

Generates Bootstrap-based HTML reports with:

  • Collapsible accordion for each check
  • Color-coded status badges
  • Nested detail formatting
  • Error highlighting

Installation

Prerequisites

  • Python 3.8+
  • pip package manager
  • OpenAI API key
  • LlamaParse API key

Setup

  1. Clone the repository:

    cd /path/to/deployment
    
  2. Install dependencies:

    pip install -r requirements.txt
    
  3. Configure API keys (see Security Considerations):

    • OpenAI API key for GPT analysis
    • LlamaParse API key for PDF parsing
  4. Set up paths:

    Update hardcoded paths in launcher scripts:

    # HM_launcher_CLI.py
    PROFILE_PATH = "/opt/QC/profiles/HM.json"
    
    # qc_module.py
    REPORTS_DIR = os.path.join(os.path.dirname(__file__), "reports")
    
  5. Create working directories:

    mkdir -p /tmp/HM_working
    mkdir -p /opt/QC/reports
    

Configuration

Profile JSON Structure

Profiles define the sequence of checks and their parameters:

[
  {
    "id": "HM_parse",
    "script": "checks.HM_parse",
    "config": {
      "description": "Parses PDF with LlamaParse",
      "input_file": "supplied by launcher script",
      "working_dir": "/tmp/HM_working"
    }
  },
  {
    "id": "HM_filename_parse",
    "script": "checks.HM_filename_parse",
    "config": {
      "description": "Parses filename into components",
      "working_dir": "/tmp/HM_working"
    }
  }
]

Key Fields:

  • id: Unique check identifier (used for context storage)
  • script: Python module path (e.g., checks.HM_parse)
  • config: Check-specific parameters

H&M Filename Format

The system expects H&M filenames to follow these patterns:

Pattern 1: dimensions_format_year_reference-number_language-country.pdf

  • Example: 21.6x27.9cm_letter_2028_10062-01_en-us.pdf

Pattern 2: dimensions_format_reference-number_(GEN|CEN).pdf

  • Example: 10.8x14cm_quarter_letter_1001D_10004-02_GEN.pdf

Usage

Basic CLI Usage

python launchers/HM_launcher_CLI.py input.pdf /path/to/reports/

This will:

  1. Execute all checks defined in /opt/QC/profiles/HM.json
  2. Generate an HTML report in the specified directory
  3. Print JSON results to stdout

Box Hotfolder Integration

python launchers/ford_qc_box_hotfolder_process.py

Configuration:

  • Modify BOX_CLI_CONFIG_PATH to point to your Box JWT config
  • Set SOURCE_FOLDER_ID and REPORT_FOLDER_ID
  • Update PROFILE_PATH to your desired profile

Workflow:

  1. Script polls Box source folder for .zip files
  2. Downloads and processes each file
  3. Generates QC report
  4. Uploads report to Box reports folder
  5. Deletes processed files

Check Modules

HM_parse

Purpose: Parses PDF using LlamaParse API

Context Output:

  • extracted_text: Full text content
  • parsed_image: First page as PIL Image
  • all_images: List of all page images

Dependencies: LlamaParse API


HM_filename_parse

Purpose: Extracts components from H&M filename using GPT

Context Input: HM_parse.filename

Context Output:

{
    "parsed": {
        "dimensions": "21.6x27.9cm",
        "format": "letter",
        "year": "2028",
        "reference": "10062-01",
        "language": "en-us"
    }
}

HM_imprint_check

Purpose: Verifies imprint/reference code matches filename

Context Input:

  • HM_parse.extracted_text
  • HM_filename_parse.parsed.reference

Logic:

  1. Uses GPT to detect imprint in document
  2. Compares against expected reference from filename
  3. Skips for "OOH" (out-of-home) files

HM_language_validate

Purpose: Validates document language matches filename code

Context Input:

  • HM_parse.extracted_text
  • HM_parse.parsed_image
  • HM_filename_parse.parsed.language

Context Output:

  • detected_language: ISO language code (e.g., "en-US")
  • matches: Boolean
  • isCensorshipRequired: True for CEN markets

Special Cases:

  • CEN: Censored market (requires body coverage checks)
  • GEN: General market (no censorship required)

HM_price_currency_check

Purpose: Validates currency matches regional expectations

Context Input:

  • HM_parse.extracted_text
  • HM_parse.parsed_image
  • HM_filename_parse.parsed.language
  • HM_language_validate.isCensorshipRequired

Logic:

  1. Detects currency and price using multimodal GPT analysis
  2. Validates currency against region (e.g., USD for en-us)
  3. Skips for CEN/GEN markets

HM_censorship

Purpose: Verifies proper clothing coverage for censored markets

Context Input:

  • HM_parse.parsed_image
  • HM_filename_parse.parsed.language
  • HM_language_validate.isCensorshipRequired

Process:

  1. Trains a DSPy-based classifier using images in /opt/QC/supporting/censorship_trainset/
  2. Analyzes document images for body coverage
  3. Validates against censorship requirements

Training Set Naming:

  • Censored images: *-C.png
  • Uncensored images: *-U.png

Technology: Uses DSPy with MIPROv2 optimization


API Integrations

1. LlamaParse

Purpose: PDF parsing and image extraction

Usage:

  • Text extraction: result_type="text"
  • Image generation: result_type="markdown" with multimodal model

Key in Code: checks/HM_parse.py:14


2. OpenAI GPT-4o

Purpose: Content analysis, language detection, validation

Usage:

  • Synchronous client in analyze_with_gpt.py
  • Multimodal support (text + images)
  • JSON and text response modes

Models Used:

  • gpt-4o: Primary analysis model
  • gpt-4o-mini: Used in DSPy for censorship detection

Key in Code: checks/analyze_with_gpt.py:13


3. DSPy

Purpose: AI-powered censorship detection

Components:

  • ImageDescription: Generates clothing coverage descriptions
  • ImageCensorshipDetection: Classifies censorship status
  • MIPROv2: Optimizes prompts based on training set

Key in Code: checks/HM_censorship.py:11


4. Box Python SDK

Purpose: Cloud storage integration for automated workflows

Authentication: JWT-based service account

Configuration File: ford_box_config.json


Security Considerations

CRITICAL ISSUES IDENTIFIED

⚠️ HARDCODED API KEYS FOUND ⚠️

The following files contain hardcoded API keys that should be immediately addressed:

  1. checks/HM_parse.py:14

    os.environ['LLAMA_CLOUD_API_KEY'] = 'llx-BmHqsgAhrUWpNJDhl25POaxe0WvwwyiwHcRpACKbJch50Lu2'
    
  2. checks/analyze_with_gpt.py:13

    client = OpenAI(api_key="sk-svcacct-yRvRUPzN0Bq2-CJgZl4tgklRcHCfBsiMUhbK308vyQj91q-Q3wqfEHlBPXZ6QyeryHT3BlbkFJxErLrQ1ycFtrcU0xoXXxweoMwcUKxpQSNiN98L9d4AtIlmnNQtotgeuBf2iqpg7_AA")
    
  3. checks/HM_censorship.py:11

    os.environ["OPENAI_API_KEY"] = "sk-proj-LaFeLI2v1p9TkGOIifAJT3BlbkFJk7SuBc0VkmmrRt5y9cQg"
    
  1. Use Environment Variables:

    # Replace hardcoded keys with:
    import os
    OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')
    LLAMA_CLOUD_API_KEY = os.getenv('LLAMA_CLOUD_API_KEY')
    
  2. Use .env Files (with python-dotenv):

    from dotenv import load_dotenv
    load_dotenv()
    
    OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')
    
  3. Secret Management Systems:

    • AWS Secrets Manager
    • HashiCorp Vault
    • Azure Key Vault
  4. Rotate Exposed Keys:

    • All hardcoded keys should be rotated immediately
    • Monitor API usage for unauthorized access

Additional Security Considerations

  • File System Access: The system writes to /tmp/HM_working and /opt/QC/reports/ - ensure proper permissions
  • Box Authentication: JWT config contains sensitive credentials - protect ford_box_config.json
  • Working Directory Cleanup: Temporary files contain potentially sensitive document content
  • Context Snapshots: Be cautious about logging context data (disabled by default in qc_module.py:117)

Development

Adding a New Check Module

  1. Create check file in checks/ directory:

    # checks/my_new_check.py
    
    def run_check(config: dict, context: dict, check_id: str) -> dict:
        # 1. Get required data from context
        input_data = context.get("previous_check", {})
    
        # 2. Perform validation logic
        result = validate_something(input_data)
    
        # 3. Store results in context
        context[check_id] = {
            "validation_result": result
        }
    
        # 4. Return status
        if result:
            return {"status": "passed", "details": {...}}
        else:
            return {"status": "error", "error_message": "..."}
    
  2. Add to profile JSON:

    {
        "id": "my_new_check",
        "script": "checks.my_new_check",
        "config": {
            "description": "Validates something important",
            "param1": "value1"
        }
    }
    

Testing

Manual Testing:

python launchers/HM_launcher_CLI.py test_file.pdf ./test_reports/

Context Debugging:

Uncomment line 117 in qc_module.py to enable context snapshots:

aggregated_results["context_snapshot"] = context

Troubleshooting

Common Issues

1. "Module not found" errors

  • Ensure /opt/QC is in your Python path
  • Check sys.path.append() statements in launchers

2. "Working directory not found"

  • Create /tmp/HM_working manually: mkdir -p /tmp/HM_working
  • Verify permissions on working directories

3. API authentication failures

  • Verify API keys are valid and not expired
  • Check API usage limits/quotas
  • Ensure proper environment variable configuration

4. Box integration issues

  • Validate JWT config file path
  • Ensure service account has folder access
  • Check Box folder IDs are correct

5. PDF parsing failures

  • Verify LlamaParse API key and quota
  • Check PDF is not corrupted or password-protected
  • Ensure sufficient disk space in working directory

6. Censorship check failures

  • Verify training set images exist in /opt/QC/supporting/censorship_trainset/
  • Ensure images follow naming convention (*-C.png for censored, *-U.png for uncensored)
  • Check DSPy model configuration

Logging

CLI Launcher: Uses Python logging module at INFO level

Box Hotfolder: Logs to log/ford_qc_script.log

Increasing Verbosity:

logging.basicConfig(level=logging.DEBUG)

Dependencies

Core Dependencies

  • Python 3.8+
  • llama-parse: PDF parsing and image extraction
  • openai: GPT-4 API integration
  • dspy: AI-powered decision systems
  • PIL/Pillow: Image processing
  • boxsdk: Box cloud storage integration
  • nest_asyncio: Async event loop management

Full Dependency List

See requirements.txt for complete list including:

  • fastapi, uvicorn, httpx (web framework components)
  • pandas, numpy (data processing)
  • nltk, tiktoken (text processing)
  • APScheduler (task scheduling)
  • datasets, optuna (ML/AI components)

System Requirements

Hardware

  • CPU: Multi-core recommended for DSPy optimization
  • RAM: 4GB minimum, 8GB+ recommended
  • Storage: 500MB + space for temporary files

Operating System

  • Linux (Ubuntu 20.04+)
  • macOS (tested on Darwin 24.5.0)
  • Windows (with path adjustments)

Python Version

  • Python 3.8 - 3.11 (tested)
  • Not tested with Python 3.12+

Roadmap / Known Limitations

Current Limitations

  1. Hardcoded file paths (/opt/QC, /tmp/HM_working)
  2. Hardcoded API keys in source code
  3. Limited error recovery in check chains
  4. No parallel check execution
  5. Box hotfolder requires manual startup

Future Improvements

  • Configurable paths via environment variables
  • Parallel check execution where possible
  • Retry logic for API failures
  • Real-time progress reporting
  • Dashboard for QC history
  • Support for additional file formats

License

[Specify license here]


Contact / Support

For issues or questions:

  • Check existing documentation in CLAUDE.md
  • Review code comments in individual modules
  • Contact development team

Appendix: File Reference

Critical Files

File Purpose Contains Sensitive Data
qc_module.py Core orchestration engine No
checks/analyze_with_gpt.py OpenAI integration ⚠️ API Key
checks/HM_parse.py PDF parsing ⚠️ API Key
checks/HM_censorship.py Censorship detection ⚠️ API Key
launchers/ford_qc_box_hotfolder_process.py Box integration ⚠️ Box Config Path
profiles/HM.json H&M check configuration No
ford_box_config.json Box JWT credentials ⚠️ Yes

Last Updated: 2025-09-30 Version: 1.0 Maintainer: H&M QC Team