No description

Find a file

michael 8a862b2928 added readme		2025-09-30 10:47:02 -05:00
checks	initial commit	2025-09-30 10:37:12 -05:00
launchers	initial commit	2025-09-30 10:37:12 -05:00
profiles	initial commit	2025-09-30 10:37:12 -05:00
supporting/censorship_trainset	initial commit	2025-09-30 10:37:12 -05:00
utils	initial commit	2025-09-30 10:37:12 -05:00
.gitignore	initial commit	2025-09-30 10:37:12 -05:00
CLAUDE.md	initial commit	2025-09-30 10:37:12 -05:00
qc_module.py	initial commit	2025-09-30 10:37:12 -05:00
README.md	added readme	2025-09-30 10:47:02 -05:00
requirements.txt	initial commit	2025-09-30 10:37:12 -05:00

README.md

H&M Quality Control (HMQC) System

Overview

The H&M Quality Control (HMQC) system is a modular Python application designed to perform automated quality control checks on PDF files for H&M marketing assets. The system validates assets against multiple criteria including filename formatting, imprint verification, language validation, pricing accuracy, and censorship requirements.

Architecture
Project Structure
Key Components
Installation
Configuration
Usage
Check Modules
API Integrations
Security Considerations
Development
Troubleshooting

Architecture

The HMQC system follows a modular, profile-based architecture:

┌─────────────┐
│  Launcher   │ (CLI or Box Hotfolder)
└──────┬──────┘
       │
       ▼
┌─────────────┐
│  QC Module  │ (Core Engine)
└──────┬──────┘
       │
       ├──► Profile JSON (defines checks)
       │
       ├──► Check Modules (modular validation)
       │       ├─ HM_parse
       │       ├─ HM_filename_parse
       │       ├─ HM_imprint_check
       │       ├─ HM_language_validate
       │       ├─ HM_price_currency_check
       │       └─ HM_censorship
       │
       └──► HTML Reporter (generates reports)

Core Principles

Modularity: Each check is an independent module implementing a standard interface
Context Sharing: Checks share data via a context dictionary for inter-check dependencies
Profile-Based Configuration: JSON profiles define which checks to run and their parameters
Standardized Results: All checks return consistent status objects (passed, error, skipped)

Project Structure

deployed/
├── qc_module.py              # Core QC engine
├── requirements.txt          # Python dependencies
├── CLAUDE.md                 # Development documentation
│
├── launchers/                # Execution entry points
│   ├── HM_launcher_CLI.py    # Command-line interface
│   └── ford_qc_box_hotfolder_process.py  # Box integration
│
├── checks/                   # QC check modules
│   ├── HM_parse.py           # PDF parsing with LlamaParse
│   ├── HM_filename_parse.py  # Filename validation
│   ├── HM_imprint_check.py   # Imprint verification
│   ├── HM_language_validate.py  # Language detection
│   ├── HM_price_currency_check.py  # Currency validation
│   ├── HM_censorship.py      # Censorship compliance (DSPy)
│   ├── analyze_with_gpt.py   # OpenAI GPT integration
│   ├── html_reporter.py      # HTML report generation
│   ├── business_data_check.py
│   ├── colour_existence_check.py
│   ├── file_size_check.py
│   ├── image_linking_check.py
│   ├── image_resolution_check.py
│   ├── missing_images_check.py
│   ├── special_requirements_mec_bau.py
│   └── unzip_and_verify_check.py
│
├── profiles/                 # Check configurations
│   ├── HM.json               # H&M standard profile
│   └── ford_bnp.json         # Ford BNP profile
│
├── supporting/               # Supporting assets
│   ├── censorship_trainset/  # Training images for censorship detection
│   └── HM_Pricing SLUSSEN_30-07-2024.pdf
│
├── utils/                    # Utilities
│   ├── report.py
│   ├── input_report.json
│   └── qc_report.html
│
└── input_bucket/             # Input file staging area

Key Components

1. QC Module (`qc_module.py`)

The core engine that orchestrates check execution.

Key Functions:

run_qc_profile(profile_path, input_file): Executes all checks defined in a profile
run_single_check(script, config, context, check_id): Runs an individual check module
run_qc_checks(profile_path, input_file, report_path): Full workflow including report generation

Context Management: Results from each check are stored in a shared context dictionary, enabling downstream checks to access data from upstream checks.

2. Launchers

CLI Launcher (`HM_launcher_CLI.py`)

Command-line interface for manual QC execution.

Usage:

python launchers/HM_launcher_CLI.py <path_to_input_file> <path_to_save_report_html>

Box Hotfolder Launcher (`ford_qc_box_hotfolder_process.py`)

Automated processing of files uploaded to Box folders.

Features:

Monitors Box folder for new .zip files
Downloads, processes, and deletes files
Uploads QC reports back to Box
File locking to prevent concurrent runs

3. Check Module Pattern

All check modules implement a standard interface:

def run_check(config: dict, context: dict, check_id: str) -> dict:
    """
    Args:
        config: Configuration parameters from profile JSON
        context: Shared context dictionary for inter-check data
        check_id: Unique identifier for this check

    Returns:
        {
            "status": "passed" | "error" | "skipped",
            "details": {...},
            "error_message": "..." (if status == "error")
        }
    """

4. HTML Reporter (`html_reporter.py`)

Generates Bootstrap-based HTML reports with:

Collapsible accordion for each check
Color-coded status badges
Nested detail formatting
Error highlighting

Installation

Prerequisites

Python 3.8+
pip package manager
OpenAI API key
LlamaParse API key

Setup

Clone the repository:
```
cd /path/to/deployment
```
Install dependencies:
```
pip install -r requirements.txt
```
Configure API keys (see Security Considerations):
- OpenAI API key for GPT analysis
- LlamaParse API key for PDF parsing

Set up paths:

Update hardcoded paths in launcher scripts:

# HM_launcher_CLI.py
PROFILE_PATH = "/opt/QC/profiles/HM.json"

# qc_module.py
REPORTS_DIR = os.path.join(os.path.dirname(__file__), "reports")

Create working directories:

mkdir -p /tmp/HM_working
mkdir -p /opt/QC/reports

Configuration

Profile JSON Structure

Profiles define the sequence of checks and their parameters:

[
  {
    "id": "HM_parse",
    "script": "checks.HM_parse",
    "config": {
      "description": "Parses PDF with LlamaParse",
      "input_file": "supplied by launcher script",
      "working_dir": "/tmp/HM_working"
    }
  },
  {
    "id": "HM_filename_parse",
    "script": "checks.HM_filename_parse",
    "config": {
      "description": "Parses filename into components",
      "working_dir": "/tmp/HM_working"
    }
  }
]

Key Fields:

id: Unique check identifier (used for context storage)
script: Python module path (e.g., checks.HM_parse)
config: Check-specific parameters

H&M Filename Format

The system expects H&M filenames to follow these patterns:

Pattern 1: dimensions_format_year_reference-number_language-country.pdf

Example: 21.6x27.9cm_letter_2028_10062-01_en-us.pdf

Pattern 2: dimensions_format_reference-number_(GEN|CEN).pdf

Example: 10.8x14cm_quarter_letter_1001D_10004-02_GEN.pdf

Usage

Basic CLI Usage

python launchers/HM_launcher_CLI.py input.pdf /path/to/reports/

This will:

Execute all checks defined in /opt/QC/profiles/HM.json
Generate an HTML report in the specified directory
Print JSON results to stdout

Box Hotfolder Integration

python launchers/ford_qc_box_hotfolder_process.py

Configuration:

Modify BOX_CLI_CONFIG_PATH to point to your Box JWT config
Set SOURCE_FOLDER_ID and REPORT_FOLDER_ID
Update PROFILE_PATH to your desired profile

Workflow:

Script polls Box source folder for .zip files
Downloads and processes each file
Generates QC report
Uploads report to Box reports folder
Deletes processed files

Check Modules

HM_parse

Purpose: Parses PDF using LlamaParse API

Context Output:

extracted_text: Full text content
parsed_image: First page as PIL Image
all_images: List of all page images

Dependencies: LlamaParse API

HM_filename_parse

Purpose: Extracts components from H&M filename using GPT

Context Input: HM_parse.filename

Context Output:

{
    "parsed": {
        "dimensions": "21.6x27.9cm",
        "format": "letter",
        "year": "2028",
        "reference": "10062-01",
        "language": "en-us"
    }
}

HM_imprint_check

Purpose: Verifies imprint/reference code matches filename

Context Input:

HM_parse.extracted_text
HM_filename_parse.parsed.reference

Logic:

Uses GPT to detect imprint in document
Compares against expected reference from filename
Skips for "OOH" (out-of-home) files

HM_language_validate

Purpose: Validates document language matches filename code

Context Input:

HM_parse.extracted_text
HM_parse.parsed_image
HM_filename_parse.parsed.language

Context Output:

detected_language: ISO language code (e.g., "en-US")
matches: Boolean
isCensorshipRequired: True for CEN markets

Special Cases:

CEN: Censored market (requires body coverage checks)
GEN: General market (no censorship required)

HM_price_currency_check

Purpose: Validates currency matches regional expectations

Context Input:

HM_parse.extracted_text
HM_parse.parsed_image
HM_filename_parse.parsed.language
HM_language_validate.isCensorshipRequired

Logic:

Detects currency and price using multimodal GPT analysis
Validates currency against region (e.g., USD for en-us)
Skips for CEN/GEN markets

HM_censorship

Purpose: Verifies proper clothing coverage for censored markets

Context Input:

HM_parse.parsed_image
HM_filename_parse.parsed.language
HM_language_validate.isCensorshipRequired

Process:

Trains a DSPy-based classifier using images in /opt/QC/supporting/censorship_trainset/
Analyzes document images for body coverage
Validates against censorship requirements

Training Set Naming:

Censored images: *-C.png
Uncensored images: *-U.png

Technology: Uses DSPy with MIPROv2 optimization

API Integrations

1. LlamaParse

Purpose: PDF parsing and image extraction

Usage:

Text extraction: result_type="text"
Image generation: result_type="markdown" with multimodal model

Key in Code: checks/HM_parse.py:14

2. OpenAI GPT-4o

Purpose: Content analysis, language detection, validation

Usage:

Synchronous client in analyze_with_gpt.py
Multimodal support (text + images)
JSON and text response modes

Models Used:

gpt-4o: Primary analysis model
gpt-4o-mini: Used in DSPy for censorship detection

Key in Code: checks/analyze_with_gpt.py:13

3. DSPy

Purpose: AI-powered censorship detection

Components:

ImageDescription: Generates clothing coverage descriptions
ImageCensorshipDetection: Classifies censorship status
MIPROv2: Optimizes prompts based on training set

Key in Code: checks/HM_censorship.py:11

4. Box Python SDK

Purpose: Cloud storage integration for automated workflows

Authentication: JWT-based service account

Configuration File: ford_box_config.json

Security Considerations

CRITICAL ISSUES IDENTIFIED

⚠️ HARDCODED API KEYS FOUND ⚠️

The following files contain hardcoded API keys that should be immediately addressed:

checks/HM_parse.py:14

os.environ['LLAMA_CLOUD_API_KEY'] = 'llx-BmHqsgAhrUWpNJDhl25POaxe0WvwwyiwHcRpACKbJch50Lu2'

checks/analyze_with_gpt.py:13

client = OpenAI(api_key="sk-svcacct-yRvRUPzN0Bq2-CJgZl4tgklRcHCfBsiMUhbK308vyQj91q-Q3wqfEHlBPXZ6QyeryHT3BlbkFJxErLrQ1ycFtrcU0xoXXxweoMwcUKxpQSNiN98L9d4AtIlmnNQtotgeuBf2iqpg7_AA")

checks/HM_censorship.py:11

os.environ["OPENAI_API_KEY"] = "sk-proj-LaFeLI2v1p9TkGOIifAJT3BlbkFJk7SuBc0VkmmrRt5y9cQg"

Recommended Security Improvements

Use Environment Variables:

# Replace hardcoded keys with:
import os
OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')
LLAMA_CLOUD_API_KEY = os.getenv('LLAMA_CLOUD_API_KEY')

Use .env Files (with python-dotenv):

from dotenv import load_dotenv
load_dotenv()

OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')

Secret Management Systems:
- AWS Secrets Manager
- HashiCorp Vault
- Azure Key Vault
Rotate Exposed Keys:
- All hardcoded keys should be rotated immediately
- Monitor API usage for unauthorized access

Additional Security Considerations

File System Access: The system writes to /tmp/HM_working and /opt/QC/reports/ - ensure proper permissions
Box Authentication: JWT config contains sensitive credentials - protect ford_box_config.json
Working Directory Cleanup: Temporary files contain potentially sensitive document content
Context Snapshots: Be cautious about logging context data (disabled by default in qc_module.py:117)

Development

Adding a New Check Module

Create check file in checks/ directory:

# checks/my_new_check.py

def run_check(config: dict, context: dict, check_id: str) -> dict:
    # 1. Get required data from context
    input_data = context.get("previous_check", {})

    # 2. Perform validation logic
    result = validate_something(input_data)

    # 3. Store results in context
    context[check_id] = {
        "validation_result": result
    }

    # 4. Return status
    if result:
        return {"status": "passed", "details": {...}}
    else:
        return {"status": "error", "error_message": "..."}

Add to profile JSON:

{
    "id": "my_new_check",
    "script": "checks.my_new_check",
    "config": {
        "description": "Validates something important",
        "param1": "value1"
    }
}

Testing

Manual Testing:

python launchers/HM_launcher_CLI.py test_file.pdf ./test_reports/

Context Debugging:

Uncomment line 117 in qc_module.py to enable context snapshots:

aggregated_results["context_snapshot"] = context

Troubleshooting

Common Issues

1. "Module not found" errors

Ensure /opt/QC is in your Python path
Check sys.path.append() statements in launchers

2. "Working directory not found"

Create /tmp/HM_working manually: mkdir -p /tmp/HM_working
Verify permissions on working directories

3. API authentication failures

Verify API keys are valid and not expired
Check API usage limits/quotas
Ensure proper environment variable configuration

4. Box integration issues

Validate JWT config file path
Ensure service account has folder access
Check Box folder IDs are correct

5. PDF parsing failures

Verify LlamaParse API key and quota
Check PDF is not corrupted or password-protected
Ensure sufficient disk space in working directory

6. Censorship check failures

Verify training set images exist in /opt/QC/supporting/censorship_trainset/
Ensure images follow naming convention (*-C.png for censored, *-U.png for uncensored)
Check DSPy model configuration

Logging

CLI Launcher: Uses Python logging module at INFO level

Box Hotfolder: Logs to log/ford_qc_script.log

Increasing Verbosity:

logging.basicConfig(level=logging.DEBUG)

Dependencies

Core Dependencies

Python 3.8+
llama-parse: PDF parsing and image extraction
openai: GPT-4 API integration
dspy: AI-powered decision systems
PIL/Pillow: Image processing
boxsdk: Box cloud storage integration
nest_asyncio: Async event loop management

Full Dependency List

See requirements.txt for complete list including:

fastapi, uvicorn, httpx (web framework components)
pandas, numpy (data processing)
nltk, tiktoken (text processing)
APScheduler (task scheduling)
datasets, optuna (ML/AI components)

System Requirements

Hardware

CPU: Multi-core recommended for DSPy optimization
RAM: 4GB minimum, 8GB+ recommended
Storage: 500MB + space for temporary files

Operating System

Linux (Ubuntu 20.04+)
macOS (tested on Darwin 24.5.0)
Windows (with path adjustments)

Python Version

Python 3.8 - 3.11 (tested)
Not tested with Python 3.12+

Roadmap / Known Limitations

Current Limitations

Hardcoded file paths (/opt/QC, /tmp/HM_working)
Hardcoded API keys in source code
Limited error recovery in check chains
No parallel check execution
Box hotfolder requires manual startup

Future Improvements

Configurable paths via environment variables
Parallel check execution where possible
Retry logic for API failures
Real-time progress reporting
Dashboard for QC history
Support for additional file formats

License

[Specify license here]

Contact / Support

For issues or questions:

Check existing documentation in CLAUDE.md
Review code comments in individual modules
Contact development team

Appendix: File Reference

Critical Files

File	Purpose	Contains Sensitive Data
`qc_module.py`	Core orchestration engine	No
`checks/analyze_with_gpt.py`	OpenAI integration	⚠️ API Key
`checks/HM_parse.py`	PDF parsing	⚠️ API Key
`checks/HM_censorship.py`	Censorship detection	⚠️ API Key
`launchers/ford_qc_box_hotfolder_process.py`	Box integration	⚠️ Box Config Path
`profiles/HM.json`	H&M check configuration	No
`ford_box_config.json`	Box JWT credentials	⚠️ Yes

Last Updated: 2025-09-30 Version: 1.0 Maintainer: H&M QC Team

README.md

H&M Quality Control (HMQC) System

Overview

Table of Contents

Architecture

Core Principles

Project Structure

Key Components

1. QC Module (qc_module.py)

2. Launchers

CLI Launcher (HM_launcher_CLI.py)

Box Hotfolder Launcher (ford_qc_box_hotfolder_process.py)

3. Check Module Pattern

4. HTML Reporter (html_reporter.py)

Installation

Prerequisites

Setup

Configuration

Profile JSON Structure

H&M Filename Format

Usage

Basic CLI Usage

Box Hotfolder Integration

Check Modules

HM_parse

HM_filename_parse

HM_imprint_check

HM_language_validate

HM_price_currency_check

HM_censorship

API Integrations

1. LlamaParse

2. OpenAI GPT-4o

3. DSPy

4. Box Python SDK

Security Considerations

CRITICAL ISSUES IDENTIFIED

Recommended Security Improvements

Additional Security Considerations

Development

Adding a New Check Module

Testing

Troubleshooting

Common Issues

Logging

Dependencies

Core Dependencies

Full Dependency List

System Requirements

Hardware

Operating System

Python Version

Roadmap / Known Limitations

Current Limitations

Future Improvements

License

Contact / Support

Appendix: File Reference

Critical Files

1. QC Module (`qc_module.py`)

CLI Launcher (`HM_launcher_CLI.py`)

Box Hotfolder Launcher (`ford_qc_box_hotfolder_process.py`)

4. HTML Reporter (`html_reporter.py`)