nickviljoen 1242af363f Updated PDF QC and addition to Image QC

2025-11-13 13:41:31 +02:00

5.5 KiB

Raw Permalink Blame History

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

The H&M Quality Control (HMQC) system is a modular Python application designed to perform quality control checks on both PDF documents and static images (JPG, PNG, PSD) for H&M marketing assets. It uses a modular approach with different specialized check modules to validate assets against criteria like filename formatting, image dimensions, imprint verification, language validation, pricing, censorship requirements, and more.

The system supports both development and production environments through environment configuration.

Key Components

Core Module: qc_module.py - The main engine that loads and runs QC checks based on profiles.
Environment Config: config.py - Handles dev/production environment paths and configuration.
Launchers: Scripts in /launchers that execute the QC process, including CLI and Box hotfolder integration.
Check Modules: Individual validation components in /checks that implement specific QC criteria.
- PDF checks: Parse, filename parse, imprint check
- Image checks: Image parse, filename parse, dimension check
- Shared checks: Language validation, currency check, censorship (CEN only)
Profiles: JSON configuration files in /profiles that define which checks to run and their parameters.
- HM.json - PDF document checks
- HM_image.json - Static image checks
HTML Reporting: Generated reports showing check results for each processed file.

Development Commands

Environment Setup

Set up development environment (uses local ./tmp/ paths):

export HM_QC_ENV=dev

For production (uses /opt/QC paths):

unset HM_QC_ENV
# or
export HM_QC_ENV=production

See DEV_SETUP.md for complete setup guide.

Running QC Checks

Run QC checks on a PDF file:

export HM_QC_ENV=dev
python launchers/HM_launcher_CLI.py <path_to_pdf> ./tmp/reports/report.html

Run QC checks on an image file (JPG, PNG, PSD):

export HM_QC_ENV=dev
python launchers/HM_launcher_CLI.py <path_to_image> ./tmp/reports/report.html

The launcher automatically detects file type and uses the appropriate profile.

Box Integration

Run the Box hotfolder integration (polls for files and processes them):

python launchers/ford_qc_box_hotfolder_process.py

Architecture Notes

Check Module Pattern

All check modules must implement the standard run_check(config, context, check_id) function:

config: Dict with the check's configuration parameters from the profile JSON.
context: Shared context dictionary between checks containing results from previous checks.
check_id: String identifier for the specific check being run.

Check modules should return a dictionary with at least a status key that can be:

passed: Check succeeded
error: Check failed with an error
skipped: Check was intentionally skipped

Results from each check are stored in a shared context dictionary, allowing subsequent checks to build on prior results.

API Dependencies

LlamaParse: Used for PDF parsing and text extraction
DSPy: Used for AI-based image analysis and content validation
BoxSDK: Used for Box integration in the hotfolder processor

Important Implementation Details

Environment Configuration: The system now uses config.py to manage paths based on environment:
- Development (HM_QC_ENV=dev): Uses ./tmp/HM_working/ and ./tmp/reports/
- Production (default): Uses /tmp/HM_working/ and /opt/QC/reports/
- All check modules and reporters use environment-aware paths
API Keys: The code contains hardcoded API keys for OpenAI and LlamaParse that should be properly managed.
File Type Detection: The CLI launcher automatically detects file type:
- .pdf files → Uses HM.json profile (PDF checks)
- .jpg, .jpeg, .png, .psd → Uses HM_image.json profile (image checks)
Report Generation: The HTML reporter creates reports in environment-aware directories:
- Development: ./tmp/reports/
- Production: /opt/QC/reports/
Error Handling: The system uses a standardized error reporting format in check results, which should be maintained.
AI Integration: Several checks use OpenAI's GPT models for complex validation tasks through the DSPy framework.

Check-Specific Details

Imprint Check (`HM_imprint_check.py`)

Now validates reference code INCLUDING country code
Example: Expected 9000_10107-06_el-CY vs Detected 9000_10107-06_el-GR → ERROR
Properly extracts full reference codes with numeric prefixes (e.g., 9000_10107-06)
Skips OOH (out-of-home) files

Censorship Check (`HM_censorship.py`)

CRITICAL RULE: Only runs on CEN files
GEN files are SKIPPED (no censorship check required)
Standard market files are SKIPPED
Uses DSPy with training images from ./supporting/censorship_trainset/

Image Dimension Check (`HM_image_dimension_check.py`)

Validates actual pixel dimensions match filename specification
Example: Filename 1200x400 must have image that is exactly 1200×400 pixels
Works with various formats: 1200x400, 1080x1920px, 21.6x27.9cm

Filename Parsing

PDF filenames: Uses GPT to parse complex H&M naming conventions
Image filenames: Uses regex pattern matching for various formats (DOOH, OOH, Display, POS, DS)
Both extract: reference, language, dimensions, format, year (if applicable)

5.5 KiB Raw Permalink Blame History Unescape Escape