No description

Find a file

nickviljoen c1ca8a0141 Updated config base path		2025-11-14 17:30:42 +02:00
checks	Updated PDF QC and addition to Image QC	2025-11-13 13:41:31 +02:00
launchers	Updated PDF QC and addition to Image QC	2025-11-13 13:41:31 +02:00
profiles	Updated PDF QC and addition to Image QC	2025-11-13 13:41:31 +02:00
supporting/censorship_trainset	initial commit	2025-09-30 10:37:12 -05:00
utils	initial commit	2025-09-30 10:37:12 -05:00
.gitignore	initial commit	2025-09-30 10:37:12 -05:00
CLAUDE.md	Updated PDF QC and addition to Image QC	2025-11-13 13:41:31 +02:00
config.py	Updated config base path	2025-11-14 17:30:42 +02:00
DEV_SETUP.md	Updated PDF QC and addition to Image QC	2025-11-13 13:41:31 +02:00
qc_module.py	Updated config base path	2025-11-14 17:30:42 +02:00
README.md	Updated PDF QC and addition to Image QC	2025-11-13 13:41:31 +02:00
requirements.txt	Updated config base path	2025-11-14 17:30:42 +02:00

README.md

H&M Quality Control (HMQC) System

Overview

The H&M Quality Control (HMQC) system is a modular Python application designed to perform automated quality control checks on both PDF documents and static images (JPG, PNG, PSD) for H&M marketing assets. The system validates assets against multiple criteria including filename formatting, image dimensions, imprint verification, language validation, pricing accuracy, and censorship requirements.

Quick Links

Development Setup Guide - Setting up local dev environment
Project Instructions - Development guidelines and architecture
Check Modules - Detailed check documentation

Architecture
Project Structure
Key Components
Installation
Environment Configuration
Configuration
Usage
Check Modules
- PDF Check Modules
- Image Check Modules
API Integrations
Security Considerations
Development
Troubleshooting

Architecture

The HMQC system follows a modular, profile-based architecture:

┌─────────────┐
│  Launcher   │ (CLI or Box Hotfolder)
└──────┬──────┘
       │
       ▼
┌─────────────┐
│  QC Module  │ (Core Engine)
└──────┬──────┘
       │
       ├──► Profile JSON (defines checks)
       │
       ├──► Check Modules (modular validation)
       │       ├─ HM_parse
       │       ├─ HM_filename_parse
       │       ├─ HM_imprint_check
       │       ├─ HM_language_validate
       │       ├─ HM_price_currency_check
       │       └─ HM_censorship
       │
       └──► HTML Reporter (generates reports)

Core Principles

Modularity: Each check is an independent module implementing a standard interface
Context Sharing: Checks share data via a context dictionary for inter-check dependencies
Profile-Based Configuration: JSON profiles define which checks to run and their parameters
Standardized Results: All checks return consistent status objects (passed, error, skipped)

Project Structure

hm_qc/
├── qc_module.py              # Core QC engine
├── config.py                 # Environment configuration (NEW)
├── requirements.txt          # Python dependencies
├── README.md                 # This file
├── CLAUDE.md                 # Project instructions for Claude
├── DEV_SETUP.md              # Development environment guide (NEW)
│
├── launchers/                # Execution entry points
│   ├── HM_launcher_CLI.py    # Command-line interface (environment-aware)
│   └── ford_qc_box_hotfolder_process.py  # Box integration
│
├── checks/                   # QC check modules
│   ├── HM_parse.py           # PDF parsing with LlamaParse
│   ├── HM_filename_parse.py  # PDF filename validation
│   ├── HM_imprint_check.py   # Imprint verification (includes country code)
│   ├── HM_image_parse.py     # Image loading and processing (NEW)
│   ├── HM_image_filename_parse.py  # Image filename parsing (NEW)
│   ├── HM_image_dimension_check.py # Image dimension validation (NEW)
│   ├── HM_language_validate.py     # Language detection
│   ├── HM_price_currency_check.py  # Currency validation
│   ├── HM_censorship.py      # Censorship compliance - CEN only (UPDATED)
│   ├── analyze_with_gpt.py   # OpenAI GPT integration
│   ├── html_reporter.py      # HTML report generation (environment-aware)
│   ├── business_data_check.py
│   ├── colour_existence_check.py
│   ├── file_size_check.py
│   ├── image_linking_check.py
│   ├── image_resolution_check.py
│   ├── missing_images_check.py
│   ├── special_requirements_mec_bau.py
│   └── unzip_and_verify_check.py
│
├── profiles/                 # Check configurations
│   ├── HM.json               # H&M PDF profile
│   ├── HM_image.json         # H&M image profile (NEW)
│   └── ford_bnp.json         # Ford BNP profile
│
├── supporting/               # Supporting assets
│   ├── censorship_trainset/  # Training images for censorship detection
│   └── HM_Pricing SLUSSEN_30-07-2024.pdf
│
├── tmp/                      # Development temporary files (gitignored)
│   ├── HM_working/           # Working directory (dev mode)
│   └── reports/              # Generated reports (dev mode)
│
└── input_bucket/             # Input file staging area

Key Components

1. QC Module (`qc_module.py`)

The core engine that orchestrates check execution.

Key Functions:

run_qc_profile(profile_path, input_file): Executes all checks defined in a profile
run_single_check(script, config, context, check_id): Runs an individual check module
run_qc_checks(profile_path, input_file, report_path): Full workflow including report generation

Context Management: Results from each check are stored in a shared context dictionary, enabling downstream checks to access data from upstream checks.

2. Launchers

CLI Launcher (`HM_launcher_CLI.py`)

Command-line interface for manual QC execution.

Usage:

python launchers/HM_launcher_CLI.py <path_to_input_file> <path_to_save_report_html>

Box Hotfolder Launcher (`ford_qc_box_hotfolder_process.py`)

Automated processing of files uploaded to Box folders.

Features:

Monitors Box folder for new .zip files
Downloads, processes, and deletes files
Uploads QC reports back to Box
File locking to prevent concurrent runs

3. Check Module Pattern

All check modules implement a standard interface:

def run_check(config: dict, context: dict, check_id: str) -> dict:
    """
    Args:
        config: Configuration parameters from profile JSON
        context: Shared context dictionary for inter-check data
        check_id: Unique identifier for this check

    Returns:
        {
            "status": "passed" | "error" | "skipped",
            "details": {...},
            "error_message": "..." (if status == "error")
        }
    """

4. HTML Reporter (`html_reporter.py`)

Generates Bootstrap-based HTML reports with:

Collapsible accordion for each check
Color-coded status badges
Nested detail formatting
Error highlighting

Installation

Prerequisites

Python 3.8+
pip package manager
OpenAI API key
LlamaParse API key

Setup

Clone the repository:
```
cd /path/to/deployment
```
Install dependencies:
```
pip install -r requirements.txt
```
Configure API keys (see Security Considerations):
- OpenAI API key for GPT analysis
- LlamaParse API key for PDF parsing

Set up paths:

Update hardcoded paths in launcher scripts:

# HM_launcher_CLI.py
PROFILE_PATH = "/opt/QC/profiles/HM.json"

# qc_module.py
REPORTS_DIR = os.path.join(os.path.dirname(__file__), "reports")

Create working directories:

mkdir -p /tmp/HM_working
mkdir -p /opt/QC/reports

Environment Configuration

The HMQC system supports two environments to enable safe local development without affecting production:

Development Environment

For local testing and development:

# Set environment variable
export HM_QC_ENV=dev

# Run checks (uses local paths in ./tmp/)
python launchers/HM_launcher_CLI.py test.pdf ./tmp/reports/report.html

Development paths:

Working directory: ./tmp/HM_working/
Reports: ./tmp/reports/
Supporting files: ./supporting/

Production Environment

For deployed production systems (default):

# Unset or use production mode
unset HM_QC_ENV
# or
export HM_QC_ENV=production

# Run checks (uses /opt/QC paths)
python launchers/HM_launcher_CLI.py input.pdf /opt/QC/reports/

Production paths:

Working directory: /tmp/HM_working/
Reports: /opt/QC/reports/
Supporting files: /opt/QC/supporting/

Configuration Module

The config.py module automatically handles environment detection:

import config

# Check current environment
print(config.ENVIRONMENT)  # 'dev' or 'production'

# Get paths
print(config.WORKING_DIR)
print(config.REPORTS_DIR)

See DEV_SETUP.md for complete development environment guide.

Configuration

Profile JSON Structure

Profiles define the sequence of checks and their parameters:

[
  {
    "id": "HM_parse",
    "script": "checks.HM_parse",
    "config": {
      "description": "Parses PDF with LlamaParse",
      "input_file": "supplied by launcher script",
      "working_dir": "/tmp/HM_working"
    }
  },
  {
    "id": "HM_filename_parse",
    "script": "checks.HM_filename_parse",
    "config": {
      "description": "Parses filename into components",
      "working_dir": "/tmp/HM_working"
    }
  }
]

Key Fields:

id: Unique check identifier (used for context storage)
script: Python module path (e.g., checks.HM_parse)
config: Check-specific parameters

H&M Filename Format

The system expects H&M filenames to follow these patterns:

Pattern 1: dimensions_format_year_reference-number_language-country.pdf

Example: 21.6x27.9cm_letter_2028_10062-01_en-us.pdf

Pattern 2: dimensions_format_reference-number_(GEN|CEN).pdf

Example: 10.8x14cm_quarter_letter_1001D_10004-02_GEN.pdf

Usage

Basic CLI Usage

python launchers/HM_launcher_CLI.py input.pdf /path/to/reports/

This will:

Execute all checks defined in /opt/QC/profiles/HM.json
Generate an HTML report in the specified directory
Print JSON results to stdout

Box Hotfolder Integration

python launchers/ford_qc_box_hotfolder_process.py

Configuration:

Modify BOX_CLI_CONFIG_PATH to point to your Box JWT config
Set SOURCE_FOLDER_ID and REPORT_FOLDER_ID
Update PROFILE_PATH to your desired profile

Workflow:

Script polls Box source folder for .zip files
Downloads and processes each file
Generates QC report
Uploads report to Box reports folder
Deletes processed files

Check Modules

The HMQC system includes separate check profiles for PDF documents and static images.

PDF Check Modules

HM_parse

Purpose: Parses PDF using LlamaParse API

Context Output:

extracted_text: Full text content
parsed_image: First page as PIL Image
all_images: List of all page images

Dependencies: LlamaParse API

HM_filename_parse

Purpose: Extracts components from H&M PDF filename using GPT

Context Input: HM_parse.filename

Context Output:

{
    "parsed": {
        "dimensions": "21.6x27.9cm",
        "format": "letter",
        "year": "2028",
        "reference": "9000_10107-06",  # Full reference with prefix
        "language": "en-us"
    }
}

Updated: Now correctly extracts full reference codes including numeric prefixes (e.g., 9000_10107-06)

HM_imprint_check

Purpose: Verifies imprint/reference code matches filename including country code

Context Input:

HM_parse.extracted_text
HM_filename_parse.parsed.reference
HM_filename_parse.parsed.language

Logic:

Uses GPT to detect imprint in document
Combines reference code with language/country code (e.g., 9000_10107-06_el-CY)
Compares against detected imprint
Skips for "OOH" (out-of-home) files

Updated: Now includes country code in validation to detect mismatches like CY vs GR

Image Check Modules

HM_image_parse

Purpose: Loads and processes static image files (JPG, PNG, PSD)

Context Output:

parsed_image: PIL Image object
filename: Original filename
image_format: Format (JPEG, PNG, PSD)
image_size: (width, height) in pixels
image_mode: Color mode (RGB, RGBA, etc.)

HM_image_filename_parse

Purpose: Parses image filename using pattern matching

Context Input: HM_image_parse.filename

Context Output:

{
    "parsed": {
        "language": "en-US",
        "format_type": "DOOH",
        "campaign_number": "4045",
        "dimensions": "1080x1920"
    }
}

Supported Formats:

SOME STATIC: Market_Language_campaignnumber_...
DOOH/OOH/Display: CampaignNumber_Type_Static_..._Size_Language-Market
POS GEN: Size_Format_Campaign_POPNumber_GEN
POS Country: Size_Format_Campaign_POPNumber_Language-Market
DS: Campaign_Name_Index_BU_Resolution_language-COUNTRY

HM_image_dimension_check (NEW)

Purpose: Validates actual image dimensions match filename specification

Context Input:

HM_image_parse.parsed_image
HM_image_filename_parse.parsed.dimensions

Logic:

Extracts expected dimensions from filename (e.g., 1200x400)
Gets actual image dimensions from PIL Image object
Compares width × height in pixels

Example:

Filename: campaign_1200x400_en-US.jpg
Expected: 1200px × 400px
Actual: 1200px × 400px → ✅ PASS
Actual: 1500px × 600px → ❌ ERROR

Shared Check Modules

These checks work for both PDFs and images:

HM_language_validate

Purpose: Validates document/image language matches filename code

Context Input:

HM_parse.extracted_text OR HM_image_parse.parsed_image
HM_filename_parse.parsed.language OR HM_image_filename_parse.parsed.language

Context Output:

detected_language: ISO language code (e.g., "en-US")
matches: Boolean
isCensorshipRequired: True for CEN markets

Special Cases:

CEN: Censored market (requires body coverage checks)
GEN: General market (no censorship required)

HM_price_currency_check

Purpose: Validates currency matches regional expectations

Context Input:

Content (text or image)
Language/market code

Logic:

Detects currency and price using multimodal GPT analysis
Validates currency against region (e.g., USD for en-us)
Skips for CEN/GEN markets

HM_censorship

Purpose: Verifies proper clothing coverage for CEN (censored) markets ONLY

Context Input:

parsed_image (from HM_parse or HM_image_parse)
parsed.language (from filename parse)

Process:

SKIPS for GEN files - GEN assets do not require censorship checks
SKIPS for standard market files - Only CEN files are checked
RUNS for CEN files only - Validates body coverage using AI

Algorithm:

Trains DSPy-based classifier using training images
Analyzes document images for body coverage
Validates against censorship requirements

Training Set: /supporting/censorship_trainset/

Censored images: *-C.png
Uncensored images: *-U.png

Technology: Uses DSPy with MIPROv2 optimization

Updated: Now correctly only runs checks on CEN files, not GEN files

API Integrations

1. LlamaParse

Purpose: PDF parsing and image extraction

Usage:

Text extraction: result_type="text"
Image generation: result_type="markdown" with multimodal model

Key in Code: checks/HM_parse.py:14

2. OpenAI GPT-4o

Purpose: Content analysis, language detection, validation

Usage:

Synchronous client in analyze_with_gpt.py
Multimodal support (text + images)
JSON and text response modes

Models Used:

gpt-4o: Primary analysis model
gpt-4o-mini: Used in DSPy for censorship detection

Key in Code: checks/analyze_with_gpt.py:13

3. DSPy

Purpose: AI-powered censorship detection

Components:

ImageDescription: Generates clothing coverage descriptions
ImageCensorshipDetection: Classifies censorship status
MIPROv2: Optimizes prompts based on training set

Key in Code: checks/HM_censorship.py:11

4. Box Python SDK

Purpose: Cloud storage integration for automated workflows

Authentication: JWT-based service account

Configuration File: ford_box_config.json

Security Considerations

CRITICAL ISSUES IDENTIFIED

⚠️ HARDCODED API KEYS FOUND ⚠️

The following files contain hardcoded API keys that should be immediately addressed:

checks/HM_parse.py:14

os.environ['LLAMA_CLOUD_API_KEY'] = 'llx-BmHqsgAhrUWpNJDhl25POaxe0WvwwyiwHcRpACKbJch50Lu2'

checks/analyze_with_gpt.py:13

client = OpenAI(api_key="sk-svcacct-yRvRUPzN0Bq2-CJgZl4tgklRcHCfBsiMUhbK308vyQj91q-Q3wqfEHlBPXZ6QyeryHT3BlbkFJxErLrQ1ycFtrcU0xoXXxweoMwcUKxpQSNiN98L9d4AtIlmnNQtotgeuBf2iqpg7_AA")

checks/HM_censorship.py:11

os.environ["OPENAI_API_KEY"] = "sk-proj-LaFeLI2v1p9TkGOIifAJT3BlbkFJk7SuBc0VkmmrRt5y9cQg"

Recommended Security Improvements

Use Environment Variables:

# Replace hardcoded keys with:
import os
OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')
LLAMA_CLOUD_API_KEY = os.getenv('LLAMA_CLOUD_API_KEY')

Use .env Files (with python-dotenv):

from dotenv import load_dotenv
load_dotenv()

OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')

Secret Management Systems:
- AWS Secrets Manager
- HashiCorp Vault
- Azure Key Vault
Rotate Exposed Keys:
- All hardcoded keys should be rotated immediately
- Monitor API usage for unauthorized access

Additional Security Considerations

File System Access: The system writes to /tmp/HM_working and /opt/QC/reports/ - ensure proper permissions
Box Authentication: JWT config contains sensitive credentials - protect ford_box_config.json
Working Directory Cleanup: Temporary files contain potentially sensitive document content
Context Snapshots: Be cautious about logging context data (disabled by default in qc_module.py:117)

Development

Adding a New Check Module

Create check file in checks/ directory:

# checks/my_new_check.py

def run_check(config: dict, context: dict, check_id: str) -> dict:
    # 1. Get required data from context
    input_data = context.get("previous_check", {})

    # 2. Perform validation logic
    result = validate_something(input_data)

    # 3. Store results in context
    context[check_id] = {
        "validation_result": result
    }

    # 4. Return status
    if result:
        return {"status": "passed", "details": {...}}
    else:
        return {"status": "error", "error_message": "..."}

Add to profile JSON:

{
    "id": "my_new_check",
    "script": "checks.my_new_check",
    "config": {
        "description": "Validates something important",
        "param1": "value1"
    }
}

Testing

Manual Testing:

python launchers/HM_launcher_CLI.py test_file.pdf ./test_reports/

Context Debugging:

Uncomment line 117 in qc_module.py to enable context snapshots:

aggregated_results["context_snapshot"] = context

Troubleshooting

Common Issues

1. "Module not found" errors

Ensure /opt/QC is in your Python path
Check sys.path.append() statements in launchers

2. "Working directory not found"

Create /tmp/HM_working manually: mkdir -p /tmp/HM_working
Verify permissions on working directories

3. API authentication failures

Verify API keys are valid and not expired
Check API usage limits/quotas
Ensure proper environment variable configuration

4. Box integration issues

Validate JWT config file path
Ensure service account has folder access
Check Box folder IDs are correct

5. PDF parsing failures

Verify LlamaParse API key and quota
Check PDF is not corrupted or password-protected
Ensure sufficient disk space in working directory

6. Censorship check failures

Verify training set images exist in /opt/QC/supporting/censorship_trainset/
Ensure images follow naming convention (*-C.png for censored, *-U.png for uncensored)
Check DSPy model configuration

Logging

CLI Launcher: Uses Python logging module at INFO level

Box Hotfolder: Logs to log/ford_qc_script.log

Increasing Verbosity:

logging.basicConfig(level=logging.DEBUG)

Dependencies

Core Dependencies

Python 3.8+
llama-parse: PDF parsing and image extraction
openai: GPT-4 API integration
dspy: AI-powered decision systems
PIL/Pillow: Image processing
boxsdk: Box cloud storage integration
nest_asyncio: Async event loop management

Full Dependency List

See requirements.txt for complete list including:

fastapi, uvicorn, httpx (web framework components)
pandas, numpy (data processing)
nltk, tiktoken (text processing)
APScheduler (task scheduling)
datasets, optuna (ML/AI components)

System Requirements

Hardware

CPU: Multi-core recommended for DSPy optimization
RAM: 4GB minimum, 8GB+ recommended
Storage: 500MB + space for temporary files

Operating System

Linux (Ubuntu 20.04+)
macOS (tested on Darwin 24.5.0)
Windows (with path adjustments)

Python Version

Python 3.8 - 3.11 (tested)
Not tested with Python 3.12+

Roadmap / Known Limitations

Current Limitations

✅ ~~Hardcoded file paths~~ - FIXED with environment configuration (config.py)
Hardcoded API keys in source code
Limited error recovery in check chains
No parallel check execution
Box hotfolder requires manual startup

Recent Improvements

✅ Environment Configuration - Dev/production mode support
✅ Image QC Support - Full validation for JPG, PNG, PSD files
✅ Image Dimension Check - Validates pixel dimensions against filename
✅ Enhanced Imprint Check - Now includes country code validation
✅ Fixed Censorship Rules - CEN-only checking (GEN files skip)

Future Improvements

Parallel check execution where possible
Retry logic for API failures
Real-time progress reporting
Dashboard for QC history
Support for additional file formats
Externalize API keys to environment variables

License

[Specify license here]

Contact / Support

For issues or questions:

Check existing documentation in CLAUDE.md
Review code comments in individual modules
Contact development team

Appendix: File Reference

Critical Files

File	Purpose	Contains Sensitive Data
`qc_module.py`	Core orchestration engine	No
`checks/analyze_with_gpt.py`	OpenAI integration	⚠️ API Key
`checks/HM_parse.py`	PDF parsing	⚠️ API Key
`checks/HM_censorship.py`	Censorship detection	⚠️ API Key
`launchers/ford_qc_box_hotfolder_process.py`	Box integration	⚠️ Box Config Path
`profiles/HM.json`	H&M check configuration	No
`ford_box_config.json`	Box JWT credentials	⚠️ Yes

Last Updated: 2025-01-12 Version: 1.2 Maintainer: H&M QC Team

Recent Updates (v1.2)

Added development environment configuration
Implemented image QC checks (JPG, PNG, PSD support)
Added image dimension validation
Enhanced imprint check with country code validation
Fixed censorship rules (CEN-only, GEN skip)
Environment-aware path configuration

README.md Unescape Escape

H&M Quality Control (HMQC) System

Overview

Quick Links

Table of Contents

Architecture

Core Principles

Project Structure

Key Components

1. QC Module (qc_module.py)

2. Launchers

CLI Launcher (HM_launcher_CLI.py)

Box Hotfolder Launcher (ford_qc_box_hotfolder_process.py)

3. Check Module Pattern

4. HTML Reporter (html_reporter.py)

Installation

Prerequisites

Setup

Environment Configuration

Development Environment

Production Environment

Configuration Module

Configuration

Profile JSON Structure

H&M Filename Format

Usage

Basic CLI Usage

Box Hotfolder Integration

Check Modules

PDF Check Modules

HM_parse

HM_filename_parse

HM_imprint_check

Image Check Modules

HM_image_parse

HM_image_filename_parse

HM_image_dimension_check (NEW)

Shared Check Modules

HM_language_validate

HM_price_currency_check

HM_censorship

API Integrations

1. LlamaParse

2. OpenAI GPT-4o

3. DSPy

4. Box Python SDK

Security Considerations

CRITICAL ISSUES IDENTIFIED

Recommended Security Improvements

Additional Security Considerations

Development

Adding a New Check Module

Testing

Troubleshooting

Common Issues

Logging

Dependencies

Core Dependencies

Full Dependency List

System Requirements

Hardware

Operating System

Python Version

Roadmap / Known Limitations

Current Limitations

Recent Improvements

Future Improvements

License

Contact / Support

Appendix: File Reference

Critical Files

Recent Updates (v1.2)

README.md

1. QC Module (`qc_module.py`)

CLI Launcher (`HM_launcher_CLI.py`)

Box Hotfolder Launcher (`ford_qc_box_hotfolder_process.py`)

4. HTML Reporter (`html_reporter.py`)