# H&M Quality Control (HMQC) System

## Overview

The H&M Quality Control (HMQC) system is a modular Python application designed to perform automated quality control checks on both PDF documents and static images (JPG, PNG, PSD) for H&M marketing assets. The system validates assets against multiple criteria including filename formatting, image dimensions, imprint verification, language validation, pricing accuracy, and censorship requirements.

## Quick Links

- **[Development Setup Guide](DEV_SETUP.md)** - Setting up local dev environment
- **[Project Instructions](CLAUDE.md)** - Development guidelines and architecture
- **[Check Modules](#check-modules)** - Detailed check documentation

## Table of Contents

- [Architecture](#architecture)
- [Project Structure](#project-structure)
- [Key Components](#key-components)
- [Installation](#installation)
- [Environment Configuration](#environment-configuration)
- [Configuration](#configuration)
- [Usage](#usage)
- [Check Modules](#check-modules)
  - [PDF Check Modules](#pdf-check-modules)
  - [Image Check Modules](#image-check-modules)
- [API Integrations](#api-integrations)
- [Security Considerations](#security-considerations)
- [Development](#development)
- [Troubleshooting](#troubleshooting)

---

## Architecture

The HMQC system follows a modular, profile-based architecture:

```
┌─────────────┐
│  Launcher   │ (CLI or Box Hotfolder)
└──────┬──────┘
       │
       ▼
┌─────────────┐
│  QC Module  │ (Core Engine)
└──────┬──────┘
       │
       ├──► Profile JSON (defines checks)
       │
       ├──► Check Modules (modular validation)
       │       ├─ HM_parse
       │       ├─ HM_filename_parse
       │       ├─ HM_imprint_check
       │       ├─ HM_language_validate
       │       ├─ HM_price_currency_check
       │       └─ HM_censorship
       │
       └──► HTML Reporter (generates reports)
```

### Core Principles

1. **Modularity**: Each check is an independent module implementing a standard interface
2. **Context Sharing**: Checks share data via a context dictionary for inter-check dependencies
3. **Profile-Based Configuration**: JSON profiles define which checks to run and their parameters
4. **Standardized Results**: All checks return consistent status objects (`passed`, `error`, `skipped`)

---

## Project Structure

```
hm_qc/
├── qc_module.py              # Core QC engine
├── config.py                 # Environment configuration (NEW)
├── requirements.txt          # Python dependencies
├── README.md                 # This file
├── CLAUDE.md                 # Project instructions for Claude
├── DEV_SETUP.md              # Development environment guide (NEW)
│
├── launchers/                # Execution entry points
│   ├── HM_launcher_CLI.py    # Command-line interface (environment-aware)
│   └── ford_qc_box_hotfolder_process.py  # Box integration
│
├── checks/                   # QC check modules
│   ├── HM_parse.py           # PDF parsing with LlamaParse
│   ├── HM_filename_parse.py  # PDF filename validation
│   ├── HM_imprint_check.py   # Imprint verification (includes country code)
│   ├── HM_image_parse.py     # Image loading and processing (NEW)
│   ├── HM_image_filename_parse.py  # Image filename parsing (NEW)
│   ├── HM_image_dimension_check.py # Image dimension validation (NEW)
│   ├── HM_language_validate.py     # Language detection
│   ├── HM_price_currency_check.py  # Currency validation
│   ├── HM_censorship.py      # Censorship compliance - CEN only (UPDATED)
│   ├── analyze_with_gpt.py   # OpenAI GPT integration
│   ├── html_reporter.py      # HTML report generation (environment-aware)
│   ├── business_data_check.py
│   ├── colour_existence_check.py
│   ├── file_size_check.py
│   ├── image_linking_check.py
│   ├── image_resolution_check.py
│   ├── missing_images_check.py
│   ├── special_requirements_mec_bau.py
│   └── unzip_and_verify_check.py
│
├── profiles/                 # Check configurations
│   ├── HM.json               # H&M PDF profile
│   ├── HM_image.json         # H&M image profile (NEW)
│   └── ford_bnp.json         # Ford BNP profile
│
├── supporting/               # Supporting assets
│   ├── censorship_trainset/  # Training images for censorship detection
│   └── HM_Pricing SLUSSEN_30-07-2024.pdf
│
├── tmp/                      # Development temporary files (gitignored)
│   ├── HM_working/           # Working directory (dev mode)
│   └── reports/              # Generated reports (dev mode)
│
└── input_bucket/             # Input file staging area
```

---

## Key Components

### 1. QC Module (`qc_module.py`)

The core engine that orchestrates check execution.

**Key Functions:**
- `run_qc_profile(profile_path, input_file)`: Executes all checks defined in a profile
- `run_single_check(script, config, context, check_id)`: Runs an individual check module
- `run_qc_checks(profile_path, input_file, report_path)`: Full workflow including report generation

**Context Management:**
Results from each check are stored in a shared `context` dictionary, enabling downstream checks to access data from upstream checks.

### 2. Launchers

#### CLI Launcher (`HM_launcher_CLI.py`)
Command-line interface for manual QC execution.

**Usage:**
```bash
python launchers/HM_launcher_CLI.py <path_to_input_file> <path_to_save_report_html>
```

#### Box Hotfolder Launcher (`ford_qc_box_hotfolder_process.py`)
Automated processing of files uploaded to Box folders.

**Features:**
- Monitors Box folder for new `.zip` files
- Downloads, processes, and deletes files
- Uploads QC reports back to Box
- File locking to prevent concurrent runs

### 3. Check Module Pattern

All check modules implement a standard interface:

```python
def run_check(config: dict, context: dict, check_id: str) -> dict:
    """
    Args:
        config: Configuration parameters from profile JSON
        context: Shared context dictionary for inter-check data
        check_id: Unique identifier for this check

    Returns:
        {
            "status": "passed" | "error" | "skipped",
            "details": {...},
            "error_message": "..." (if status == "error")
        }
    """
```

### 4. HTML Reporter (`html_reporter.py`)

Generates Bootstrap-based HTML reports with:
- Collapsible accordion for each check
- Color-coded status badges
- Nested detail formatting
- Error highlighting

---

## Installation

### Prerequisites

- Python 3.8+
- pip package manager
- OpenAI API key
- LlamaParse API key

### Setup

1. **Clone the repository:**
   ```bash
   cd /path/to/deployment
   ```

2. **Install dependencies:**
   ```bash
   pip install -r requirements.txt
   ```

3. **Configure API keys** (see [Security Considerations](#security-considerations)):
   - OpenAI API key for GPT analysis
   - LlamaParse API key for PDF parsing

4. **Set up paths:**

   Update hardcoded paths in launcher scripts:
   ```python
   # HM_launcher_CLI.py
   PROFILE_PATH = "/opt/QC/profiles/HM.json"

   # qc_module.py
   REPORTS_DIR = os.path.join(os.path.dirname(__file__), "reports")
   ```

5. **Create working directories:**
   ```bash
   mkdir -p /tmp/HM_working
   mkdir -p /opt/QC/reports
   ```

---

## Environment Configuration

The HMQC system supports two environments to enable safe local development without affecting production:

### Development Environment

For local testing and development:

```bash
# Set environment variable
export HM_QC_ENV=dev

# Run checks (uses local paths in ./tmp/)
python launchers/HM_launcher_CLI.py test.pdf ./tmp/reports/report.html
```

**Development paths:**
- Working directory: `./tmp/HM_working/`
- Reports: `./tmp/reports/`
- Supporting files: `./supporting/`

### Production Environment

For deployed production systems (default):

```bash
# Unset or use production mode
unset HM_QC_ENV
# or
export HM_QC_ENV=production

# Run checks (uses /opt/QC paths)
python launchers/HM_launcher_CLI.py input.pdf /opt/QC/reports/
```

**Production paths:**
- Working directory: `/tmp/HM_working/`
- Reports: `/opt/QC/reports/`
- Supporting files: `/opt/QC/supporting/`

### Configuration Module

The `config.py` module automatically handles environment detection:

```python
import config

# Check current environment
print(config.ENVIRONMENT)  # 'dev' or 'production'

# Get paths
print(config.WORKING_DIR)
print(config.REPORTS_DIR)
```

**See [DEV_SETUP.md](DEV_SETUP.md) for complete development environment guide.**

---

## Configuration

### Profile JSON Structure

Profiles define the sequence of checks and their parameters:

```json
[
  {
    "id": "HM_parse",
    "script": "checks.HM_parse",
    "config": {
      "description": "Parses PDF with LlamaParse",
      "input_file": "supplied by launcher script",
      "working_dir": "/tmp/HM_working"
    }
  },
  {
    "id": "HM_filename_parse",
    "script": "checks.HM_filename_parse",
    "config": {
      "description": "Parses filename into components",
      "working_dir": "/tmp/HM_working"
    }
  }
]
```

**Key Fields:**
- `id`: Unique check identifier (used for context storage)
- `script`: Python module path (e.g., `checks.HM_parse`)
- `config`: Check-specific parameters

### H&M Filename Format

The system expects H&M filenames to follow these patterns:

**Pattern 1:** `dimensions_format_year_reference-number_language-country.pdf`
- Example: `21.6x27.9cm_letter_2028_10062-01_en-us.pdf`

**Pattern 2:** `dimensions_format_reference-number_(GEN|CEN).pdf`
- Example: `10.8x14cm_quarter_letter_1001D_10004-02_GEN.pdf`

---

## Usage

### Basic CLI Usage

```bash
python launchers/HM_launcher_CLI.py input.pdf /path/to/reports/
```

This will:
1. Execute all checks defined in `/opt/QC/profiles/HM.json`
2. Generate an HTML report in the specified directory
3. Print JSON results to stdout

### Box Hotfolder Integration

```bash
python launchers/ford_qc_box_hotfolder_process.py
```

**Configuration:**
- Modify `BOX_CLI_CONFIG_PATH` to point to your Box JWT config
- Set `SOURCE_FOLDER_ID` and `REPORT_FOLDER_ID`
- Update `PROFILE_PATH` to your desired profile

**Workflow:**
1. Script polls Box source folder for `.zip` files
2. Downloads and processes each file
3. Generates QC report
4. Uploads report to Box reports folder
5. Deletes processed files

---

## Check Modules

The HMQC system includes separate check profiles for PDF documents and static images.

### PDF Check Modules

#### HM_parse
**Purpose:** Parses PDF using LlamaParse API

**Context Output:**
- `extracted_text`: Full text content
- `parsed_image`: First page as PIL Image
- `all_images`: List of all page images

**Dependencies:** LlamaParse API

---

#### HM_filename_parse
**Purpose:** Extracts components from H&M PDF filename using GPT

**Context Input:** `HM_parse.filename`

**Context Output:**
```python
{
    "parsed": {
        "dimensions": "21.6x27.9cm",
        "format": "letter",
        "year": "2028",
        "reference": "9000_10107-06",  # Full reference with prefix
        "language": "en-us"
    }
}
```

**Updated:** Now correctly extracts full reference codes including numeric prefixes (e.g., `9000_10107-06`)

---

#### HM_imprint_check
**Purpose:** Verifies imprint/reference code matches filename including country code

**Context Input:**
- `HM_parse.extracted_text`
- `HM_filename_parse.parsed.reference`
- `HM_filename_parse.parsed.language`

**Logic:**
1. Uses GPT to detect imprint in document
2. Combines reference code with language/country code (e.g., `9000_10107-06_el-CY`)
3. Compares against detected imprint
4. Skips for "OOH" (out-of-home) files

**Updated:** Now includes country code in validation to detect mismatches like CY vs GR

---

### Image Check Modules

#### HM_image_parse
**Purpose:** Loads and processes static image files (JPG, PNG, PSD)

**Context Output:**
- `parsed_image`: PIL Image object
- `filename`: Original filename
- `image_format`: Format (JPEG, PNG, PSD)
- `image_size`: (width, height) in pixels
- `image_mode`: Color mode (RGB, RGBA, etc.)

---

#### HM_image_filename_parse
**Purpose:** Parses image filename using pattern matching

**Context Input:** `HM_image_parse.filename`

**Context Output:**
```python
{
    "parsed": {
        "language": "en-US",
        "format_type": "DOOH",
        "campaign_number": "4045",
        "dimensions": "1080x1920"
    }
}
```

**Supported Formats:**
- SOME STATIC: `Market_Language_campaignnumber_...`
- DOOH/OOH/Display: `CampaignNumber_Type_Static_..._Size_Language-Market`
- POS GEN: `Size_Format_Campaign_POPNumber_GEN`
- POS Country: `Size_Format_Campaign_POPNumber_Language-Market`
- DS: `Campaign_Name_Index_BU_Resolution_language-COUNTRY`

---

#### HM_image_dimension_check *(NEW)*
**Purpose:** Validates actual image dimensions match filename specification

**Context Input:**
- `HM_image_parse.parsed_image`
- `HM_image_filename_parse.parsed.dimensions`

**Logic:**
1. Extracts expected dimensions from filename (e.g., `1200x400`)
2. Gets actual image dimensions from PIL Image object
3. Compares width × height in pixels

**Example:**
- Filename: `campaign_1200x400_en-US.jpg`
- Expected: 1200px × 400px
- Actual: 1200px × 400px → ✅ PASS
- Actual: 1500px × 600px → ❌ ERROR

---

### Shared Check Modules

These checks work for both PDFs and images:

#### HM_language_validate
**Purpose:** Validates document/image language matches filename code

**Context Input:**
- `HM_parse.extracted_text` OR `HM_image_parse.parsed_image`
- `HM_filename_parse.parsed.language` OR `HM_image_filename_parse.parsed.language`

**Context Output:**
- `detected_language`: ISO language code (e.g., "en-US")
- `matches`: Boolean
- `isCensorshipRequired`: True for CEN markets

**Special Cases:**
- `CEN`: Censored market (requires body coverage checks)
- `GEN`: General market (no censorship required)

---

#### HM_price_currency_check
**Purpose:** Validates currency matches regional expectations

**Context Input:**
- Content (text or image)
- Language/market code

**Logic:**
1. Detects currency and price using multimodal GPT analysis
2. Validates currency against region (e.g., USD for en-us)
3. Skips for CEN/GEN markets

---

#### HM_censorship
**Purpose:** Verifies proper clothing coverage for CEN (censored) markets ONLY

**Context Input:**
- `parsed_image` (from HM_parse or HM_image_parse)
- `parsed.language` (from filename parse)

**Process:**
1. **SKIPS for GEN files** - GEN assets do not require censorship checks
2. **SKIPS for standard market files** - Only CEN files are checked
3. **RUNS for CEN files only** - Validates body coverage using AI

**Algorithm:**
1. Trains DSPy-based classifier using training images
2. Analyzes document images for body coverage
3. Validates against censorship requirements

**Training Set:** `/supporting/censorship_trainset/`
- Censored images: `*-C.png`
- Uncensored images: `*-U.png`

**Technology:** Uses DSPy with MIPROv2 optimization

**Updated:** Now correctly only runs checks on CEN files, not GEN files

---

## API Integrations

### 1. LlamaParse
**Purpose:** PDF parsing and image extraction

**Usage:**
- Text extraction: `result_type="text"`
- Image generation: `result_type="markdown"` with multimodal model

**Key in Code:** `checks/HM_parse.py:14`

---

### 2. OpenAI GPT-4o
**Purpose:** Content analysis, language detection, validation

**Usage:**
- Synchronous client in `analyze_with_gpt.py`
- Multimodal support (text + images)
- JSON and text response modes

**Models Used:**
- `gpt-4o`: Primary analysis model
- `gpt-4o-mini`: Used in DSPy for censorship detection

**Key in Code:** `checks/analyze_with_gpt.py:13`

---

### 3. DSPy
**Purpose:** AI-powered censorship detection

**Components:**
- `ImageDescription`: Generates clothing coverage descriptions
- `ImageCensorshipDetection`: Classifies censorship status
- `MIPROv2`: Optimizes prompts based on training set

**Key in Code:** `checks/HM_censorship.py:11`

---

### 4. Box Python SDK
**Purpose:** Cloud storage integration for automated workflows

**Authentication:** JWT-based service account

**Configuration File:** `ford_box_config.json`

---

## Security Considerations

### CRITICAL ISSUES IDENTIFIED

⚠️ **HARDCODED API KEYS FOUND** ⚠️

The following files contain hardcoded API keys that should be immediately addressed:

1. **`checks/HM_parse.py:14`**
   ```python
   os.environ['LLAMA_CLOUD_API_KEY'] = 'llx-BmHqsgAhrUWpNJDhl25POaxe0WvwwyiwHcRpACKbJch50Lu2'
   ```

2. **`checks/analyze_with_gpt.py:13`**
   ```python
   client = OpenAI(api_key="sk-svcacct-yRvRUPzN0Bq2-CJgZl4tgklRcHCfBsiMUhbK308vyQj91q-Q3wqfEHlBPXZ6QyeryHT3BlbkFJxErLrQ1ycFtrcU0xoXXxweoMwcUKxpQSNiN98L9d4AtIlmnNQtotgeuBf2iqpg7_AA")
   ```

3. **`checks/HM_censorship.py:11`**
   ```python
   os.environ["OPENAI_API_KEY"] = "sk-proj-LaFeLI2v1p9TkGOIifAJT3BlbkFJk7SuBc0VkmmrRt5y9cQg"
   ```

### Recommended Security Improvements

1. **Use Environment Variables:**
   ```python
   # Replace hardcoded keys with:
   import os
   OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')
   LLAMA_CLOUD_API_KEY = os.getenv('LLAMA_CLOUD_API_KEY')
   ```

2. **Use .env Files (with python-dotenv):**
   ```python
   from dotenv import load_dotenv
   load_dotenv()

   OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')
   ```

3. **Secret Management Systems:**
   - AWS Secrets Manager
   - HashiCorp Vault
   - Azure Key Vault

4. **Rotate Exposed Keys:**
   - All hardcoded keys should be rotated immediately
   - Monitor API usage for unauthorized access

### Additional Security Considerations

- **File System Access**: The system writes to `/tmp/HM_working` and `/opt/QC/reports/` - ensure proper permissions
- **Box Authentication**: JWT config contains sensitive credentials - protect `ford_box_config.json`
- **Working Directory Cleanup**: Temporary files contain potentially sensitive document content
- **Context Snapshots**: Be cautious about logging context data (disabled by default in `qc_module.py:117`)

---

## Development

### Adding a New Check Module

1. **Create check file in `checks/` directory:**
   ```python
   # checks/my_new_check.py

   def run_check(config: dict, context: dict, check_id: str) -> dict:
       # 1. Get required data from context
       input_data = context.get("previous_check", {})

       # 2. Perform validation logic
       result = validate_something(input_data)

       # 3. Store results in context
       context[check_id] = {
           "validation_result": result
       }

       # 4. Return status
       if result:
           return {"status": "passed", "details": {...}}
       else:
           return {"status": "error", "error_message": "..."}
   ```

2. **Add to profile JSON:**
   ```json
   {
       "id": "my_new_check",
       "script": "checks.my_new_check",
       "config": {
           "description": "Validates something important",
           "param1": "value1"
       }
   }
   ```

### Testing

**Manual Testing:**
```bash
python launchers/HM_launcher_CLI.py test_file.pdf ./test_reports/
```

**Context Debugging:**

Uncomment line 117 in `qc_module.py` to enable context snapshots:
```python
aggregated_results["context_snapshot"] = context
```

---

## Troubleshooting

### Common Issues

**1. "Module not found" errors**
- Ensure `/opt/QC` is in your Python path
- Check `sys.path.append()` statements in launchers

**2. "Working directory not found"**
- Create `/tmp/HM_working` manually: `mkdir -p /tmp/HM_working`
- Verify permissions on working directories

**3. API authentication failures**
- Verify API keys are valid and not expired
- Check API usage limits/quotas
- Ensure proper environment variable configuration

**4. Box integration issues**
- Validate JWT config file path
- Ensure service account has folder access
- Check Box folder IDs are correct

**5. PDF parsing failures**
- Verify LlamaParse API key and quota
- Check PDF is not corrupted or password-protected
- Ensure sufficient disk space in working directory

**6. Censorship check failures**
- Verify training set images exist in `/opt/QC/supporting/censorship_trainset/`
- Ensure images follow naming convention (`*-C.png` for censored, `*-U.png` for uncensored)
- Check DSPy model configuration

### Logging

**CLI Launcher:** Uses Python `logging` module at INFO level

**Box Hotfolder:** Logs to `log/ford_qc_script.log`

**Increasing Verbosity:**
```python
logging.basicConfig(level=logging.DEBUG)
```

---

## Dependencies

### Core Dependencies
- **Python 3.8+**
- **llama-parse**: PDF parsing and image extraction
- **openai**: GPT-4 API integration
- **dspy**: AI-powered decision systems
- **PIL/Pillow**: Image processing
- **boxsdk**: Box cloud storage integration
- **nest_asyncio**: Async event loop management

### Full Dependency List
See `requirements.txt` for complete list including:
- `fastapi`, `uvicorn`, `httpx` (web framework components)
- `pandas`, `numpy` (data processing)
- `nltk`, `tiktoken` (text processing)
- `APScheduler` (task scheduling)
- `datasets`, `optuna` (ML/AI components)

---

## System Requirements

### Hardware
- **CPU**: Multi-core recommended for DSPy optimization
- **RAM**: 4GB minimum, 8GB+ recommended
- **Storage**: 500MB + space for temporary files

### Operating System
- Linux (Ubuntu 20.04+)
- macOS (tested on Darwin 24.5.0)
- Windows (with path adjustments)

### Python Version
- Python 3.8 - 3.11 (tested)
- Not tested with Python 3.12+

---

## Roadmap / Known Limitations

### Current Limitations
1. ✅ ~~Hardcoded file paths~~ - **FIXED** with environment configuration (`config.py`)
2. Hardcoded API keys in source code
3. Limited error recovery in check chains
4. No parallel check execution
5. Box hotfolder requires manual startup

### Recent Improvements
- ✅ **Environment Configuration** - Dev/production mode support
- ✅ **Image QC Support** - Full validation for JPG, PNG, PSD files
- ✅ **Image Dimension Check** - Validates pixel dimensions against filename
- ✅ **Enhanced Imprint Check** - Now includes country code validation
- ✅ **Fixed Censorship Rules** - CEN-only checking (GEN files skip)

### Future Improvements
- Parallel check execution where possible
- Retry logic for API failures
- Real-time progress reporting
- Dashboard for QC history
- Support for additional file formats
- Externalize API keys to environment variables

---

## License

[Specify license here]

---

## Contact / Support

For issues or questions:
- Check existing documentation in `CLAUDE.md`
- Review code comments in individual modules
- Contact development team

---

## Appendix: File Reference

### Critical Files

| File | Purpose | Contains Sensitive Data |
|------|---------|------------------------|
| `qc_module.py` | Core orchestration engine | No |
| `checks/analyze_with_gpt.py` | OpenAI integration | ⚠️ API Key |
| `checks/HM_parse.py` | PDF parsing | ⚠️ API Key |
| `checks/HM_censorship.py` | Censorship detection | ⚠️ API Key |
| `launchers/ford_qc_box_hotfolder_process.py` | Box integration | ⚠️ Box Config Path |
| `profiles/HM.json` | H&M check configuration | No |
| `ford_box_config.json` | Box JWT credentials | ⚠️ Yes |

---

**Last Updated:** 2025-01-12
**Version:** 1.2
**Maintainer:** H&M QC Team

## Recent Updates (v1.2)

- Added development environment configuration
- Implemented image QC checks (JPG, PNG, PSD support)
- Added image dimension validation
- Enhanced imprint check with country code validation
- Fixed censorship rules (CEN-only, GEN skip)
- Environment-aware path configuration