ferrero-opentext/Python-Version/MARKDOWN_DOCS/CLAUDE.md
nickviljoen 03c5ab65a8 Docs: Update README and CLAUDE.md with folder-only template and EOL workflow
Added documentation for template-based folder-only mode (-N flag),
asset type overrides (EOL), environment-specific field mappings,
and updated config file references.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 21:33:35 +02:00

15 KiB

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

Ferrero DAM Content Scaling - Automated Python Workflow System

This is a production automation system that orchestrates asset workflows between OpenText DAM, Box, and PostgreSQL. The system processes digital assets through multiple lifecycle stages (A1→A2, A5→A6, B1→B2, A2→A3) with automated email notifications and daily reporting.

Key characteristics:

  • Production system with cron automation (every 5 minutes)
  • Dual authentication support: OAuth2 (default) and mTLS certificate
  • Python 3.6+ compatible for server deployment
  • Heavy use of external APIs (OpenText DAM, Box SDK)
  • PostgreSQL database with JSONB metadata storage

Common Commands

Development Setup

# Create and activate virtual environment
python3 -m venv venv
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

# Test connections
python scripts/test_connection.py

Running Workflows

OAuth2 (default):

python scripts/a1_to_a2_download.py
python scripts/a5_to_a6_download.py
python scripts/b1_to_b2_download.py
python scripts/a2_to_a3_upload_polling.py

mTLS authentication (requires --auth-pfx flag and whitelisted IP):

python scripts/a1_to_a2_download.py --auth-pfx
python scripts/a5_to_a6_download.py --auth-pfx
python scripts/b1_to_b2_download.py --auth-pfx

Testing

# Test all connections
python scripts/test_connection.py

# Test mTLS certificate
python scripts/test_mtls_cert.py
python scripts/test_connection.py --auth-pfx

# Generate daily report manually
python scripts/daily_report.py

Database Operations

# Connect to database
PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking

# View recent assets
psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c "SELECT tracking_id, original_filename, created_at FROM master_assets ORDER BY created_at DESC LIMIT 10;"

# Initialize/reset database
psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -f database/init.sql

Monitoring

# Watch workflow logs
tail -f logs/a1_to_a2.log
tail -f logs/a5_to_a6.log

# Search for errors
grep -i error logs/*.log
grep -i "NOT APPROVED" logs/a5_to_a6.log

# View cron execution logs
tail -f logs/cron_a1_a2.log

Architecture

System Design

Multi-stage workflow orchestration:

  1. A1→A2: Downloads master assets from DAM, uploads to Box with tracking IDs
  2. A5→A6: Filters and downloads ONLY rejected assets (ECOMMERCE STATUS = "NOT APPROVED")
  3. B1→B2: Handles global master campaigns separately
  4. A2→A3: Polls Box for agency uploads, matches tracking IDs, uploads back to DAM

Core Components

scripts/shared/ - Shared library modules used by all workflows:

  • dam_client.py: OpenText DAM API client with dual authentication

    • OAuth2 token management with auto-refresh
    • mTLS certificate authentication (PFX to PEM conversion)
    • Recursive folder search for assets (critical for subfolder support)
    • Asset download and metadata extraction
    • Important: Uses different base URLs for OAuth2 vs mTLS
  • box_client.py: Box SDK wrapper

    • JWT authentication via Box-config.json (located one folder up: ../Box-config.json)
    • Creates campaign folders with tracking ID naming
    • Preserves DAM folder structure during upload
    • Three separate Box folders for different workflows (A1→A2, A2→A3, B1→B2)
  • database.py: PostgreSQL connection pool and operations

    • Generates unique 6-character tracking IDs
    • find_or_create_tracking_id(): Smart lookup by opentext_id + local_campaign_id (reuses IDs for rework workflow)
    • Stores full metadata in JSONB column (no truncation)
    • Checks campaign completion status for A1→A2 updates
  • notifier.py: Email and webhook notifications

    • 12 email templates with color-coded themes
    • Mailgun SMTP integration
    • Detailed asset listings with Box URLs
    • Rejection comment extraction (A5→A6 specific)
  • metadata_extractor_mvp.py: Field mapping and metadata transformation

    • Loads MVP fields from environment-specific config (field_mappings_ppr.yaml or field_mappings_prod.yaml)
    • Two tracking modes: Full inheritance (from master metadata) and folder-only (-N suffix, uses config/asset_representation_template.json as base)
    • Handles filename-based updates, forced values, defaults, and asset type overrides
    • Asset type overrides (e.g., EOL) can set/remove fields with final precedence
    • Force-sets required values (e.g., STATE = "Local")
    • Uses DomainValue format for domained fields when setting values on template fields

Configuration Architecture

Hierarchical config system:

  • .env: Environment variables (credentials, never committed)
  • config/config.yaml: Main configuration (references .env vars)
  • config/field_mappings_ppr.yaml: PPR field definitions (auto-loaded when DAM URL contains 'ppr')
  • config/field_mappings_prod.yaml: PROD field definitions (auto-loaded otherwise)
  • config/asset_type_mappings.yaml: 3-letter code to DAM code mappings (e.g., EOL -> externallegalopinion)
  • config/asset_representation_template.json: Reference template for folder-only mode (-N flag), contains full field metadata structure
  • ../Box-config.json: Box JWT credentials (one directory up from Python-Version)

Important: Box-config.json MUST be located at ../Box-config.json (one folder up). This is hardcoded in config.yaml as rsa_private_key_path: ../Box-config.json.

Database Schema

Four main tables:

  1. master_assets: 35 columns with JSONB full_metadata
    • tracking_id (6-char unique), opentext_id, original_filename
    • Campaign relationships: global_master_campaign_id, local_campaign_id
    • JSONB storage prevents metadata truncation
  2. derivative_assets: Localized versions linked by tracking_id
  3. asset_events: Audit trail of all operations
  4. workflow_state: Tracks campaign processing state

Key indexes:

  • idx_master_assets_tracking_id for fast lookups
  • idx_master_assets_opentext_local for rework workflow matching
  • GIN indexes on JSONB fields for metadata queries

Workflow-Specific Logic

A1→A2 (Master Download):

  • Processes ONE campaign per run (cron handles frequency)
  • Recursive search in "Master Assets" folder
  • ONLY updates status A1→A2 if ALL assets succeed
  • Uses BOX_ROOT_FOLDER_A1_A2 (348304357505)

A5→A6 (Rejection/Rework):

  • Critical filter: Only processes assets where ECOMMERCE STATUS = "NOT APPROVED"
  • Extracts 10 rejection-related fields (approver/legal/IA&CC comments)
  • Reuses tracking IDs via find_or_create_tracking_id()
  • Uses separate Box folder: BOX_ROOT_FOLDER_A5_A6 (349441822875)
  • Search is in "Final Assets" folder, not "Master Assets"

B1→B2 (Global Masters):

  • Filters for campaigns where type = "Global comm" AND status = B1
  • Uses "MASTERS_" prefix for Box folder names
  • Searches "Final Assets" folder (B1 uses Final, not Master)
  • Uses BOX_ROOT_FOLDER_B1_B2 (349261192115)
  • Does NOT send webhook (only email)

A2→A3 (Upload from Box):

  • Polls BOX_ROOT_FOLDER_A2_A3 (348526703108) for new files
  • Parses tracking ID from filename (V2 format)
  • Two tracking modes:
    • Full inheritance: Loads master metadata from database, inherits all fields
    • Folder-only (-N suffix): Uses config/asset_representation_template.json as base, populates from filename
  • Updates Description, Language, State fields from filename
  • Applies asset type overrides (e.g., EOL sets Agency="-", Languages="Global", IPRights="Yes", removes validity dates)
  • Deletes file from Box after successful upload
  • Updates campaign status A2→A3 when ALL assets uploaded

Authentication Modes

Two separate authentication systems:

OAuth2 (Production Default):

mTLS Certificate (Optional):

  • Base URL: DAM_MTLS_BASE_URL (different endpoint: https://dev-auth.app-api.ferrero.com/00003/mm)
  • Requires PFX certificate at DAM_MTLS_CERT_PATH
  • Runtime flag: --auth-pfx
  • Requires IP whitelisting
  • PFX converted to PEM on-the-fly using cryptography library

Development Guidelines

Adding New Workflow Scripts

New workflow scripts should follow this pattern:

  1. Import from scripts.shared modules
  2. Setup rotating log handler (10MB, 28 backups)
  3. Parse command-line args (include --auth-pfx for mTLS support)
  4. Load config via load_config('config/config.yaml')
  5. Initialize clients (DAMClient, BoxClient, Database, Notifier)
  6. Process campaigns individually (one per run for cron safety)
  7. Send email notifications for all outcomes (success/partial/error)
  8. Log extensively with campaign and asset context

Modifying Field Mappings

To add/remove fields, edit the environment-specific file:

  • PPR: config/field_mappings_ppr.yaml
  • PROD: config/field_mappings_prod.yaml
mvp_fields:
  - FERRERO.FIELD.NEW_FIELD_NAME  # Add here
  - FERRERO.FIELD.ANOTHER_FIELD    # Add here

No code changes required - the system dynamically loads fields at runtime.

Asset Type Overrides

To add field overrides for a specific asset type, add an asset_type_overrides section to the field mappings file:

asset_type_overrides:
  EOL:  # Keyed by 3-letter asset type code
    FERRERO.MARKETING.FIELD.AGENCY NAME: "-"
    MAIN_LANGUAGES: "Global"
    FERRERO.FIELD.ASSET VALIDITY START PERIOD: ""  # Empty string removes the field

Overrides run after all other field processing (forced values, defaults) and take final precedence. An empty string value removes the field entirely from the payload.

Database Queries

Common patterns used in the codebase:

# Generate unique tracking ID
tracking_id = db.generate_unique_tracking_id()

# Find or reuse existing tracking ID (for rework)
result = db.find_or_create_tracking_id(opentext_id, local_campaign_id)

# Store master asset with full metadata
db.store_master_asset(
    tracking_id=tracking_id,
    opentext_id=asset_id,
    asset_data=asset,  # Full dict stored in JSONB
    box_file_id=box_result['file_id'],
    box_url=box_result['url'],
    upload_folder_id=final_folder_id,
    global_master_campaign_id=global_ref['global_master_campaign_id'],
    local_campaign_id=campaign_number
)

# Check if all campaign assets uploaded
all_done = db.check_all_campaign_assets_uploaded(campaign_number)

Email Notifications

12 template types in notifier.py:

  • a1_to_a2_complete, a1_to_a2_partial, a1_to_a2_no_assets
  • a5_to_a6_rejections, a5_to_a6_partial, a5_to_a6_no_rejections
  • b1_to_b2_complete, b1_to_b2_partial, b1_to_b2_no_assets
  • a2_to_a3_file_uploaded, a2_to_a3_complete
  • daily_summary_report

Each template includes campaign context, asset lists, Box URLs, and next steps.

Error Handling Patterns

The codebase uses extensive try/except with logging:

try:
    # Process asset
    file_path = dam.download_asset(asset_id, output_dir)
    box_result = box.upload_with_tracking_id(file_path, ...)
    db.store_master_asset(...)
    processed_assets.append(asset_info)
except Exception as e:
    logger.error("Failed: {}".format(str(e)))
    failed_assets.append(asset_info)

Important: Workflows only update campaign status if ALL assets succeed to prevent partial state issues.

Python 3.6 Compatibility

Required for server deployment:

  • Use .format() instead of f-strings
  • Avoid walrus operator (:=)
  • No dict union operator (|)
  • Type hints are optional
  • Use os.path instead of Path where possible

Deployment Notes

Server Directory Structure

/opt/ferrero-automation/
├── Box-config.json          # Box JWT config HERE (one level up!)
└── Python-Version/
    ├── .env                 # Environment variables
    ├── config/
    │   ├── config.yaml
    │   ├── field_mappings_ppr.yaml
    │   ├── field_mappings_prod.yaml
    │   ├── asset_type_mappings.yaml
    │   ├── asset_representation_template.json
    │   └── certificates/
    │       └── dam-mtls-dev.pfx
    ├── database/
    │   └── init.sql
    ├── scripts/
    │   ├── *.py             # Workflow scripts
    │   └── shared/          # Shared modules
    ├── logs/                # Auto-created by scripts
    ├── temp/                # Auto-created for downloads
    └── venv/                # Virtual environment

Cron Configuration

Production cron runs workflows every 5 minutes:

*/5 * * * * cd /opt/ferrero-automation/Python-Version && venv/bin/python scripts/a1_to_a2_download.py >> logs/cron_a1_a2.log 2>&1
*/5 * * * * cd /opt/ferrero-automation/Python-Version && venv/bin/python scripts/a5_to_a6_download.py >> logs/cron_a5_a6.log 2>&1
*/5 * * * * cd /opt/ferrero-automation/Python-Version && venv/bin/python scripts/b1_to_b2_download.py >> logs/cron_b1_b2.log 2>&1
*/5 * * * * cd /opt/ferrero-automation/Python-Version && venv/bin/python scripts/a2_to_a3_upload_polling.py >> logs/cron_a2_a3.log 2>&1
0 19 * * * cd /opt/ferrero-automation/Python-Version && venv/bin/python scripts/daily_report.py >> logs/daily_report.log 2>&1

Security Considerations

Files to protect (chmod 600):

  • .env (all credentials)
  • config/certificates/*.pfx (mTLS certificates)
  • ../Box-config.json (Box JWT private key)

Gitignored secrets:

  • .env
  • config/certificates/
  • *.pfx, *.p12
  • logs/, temp/

Production checklist:

  • Change default database password (ferrero_pass_2025)
  • Update SMTP credentials
  • Get production mTLS certificate (current is dev)
  • Verify all Box folder IDs are production IDs
  • Configure firewall rules for IP whitelisting (if using mTLS)

Troubleshooting

"Box-config.json not found"

Cause: Box JWT file must be one directory up from Python-Version Fix: Verify file exists at ../Box-config.json relative to Python-Version directory

"mTLS connection refused"

Cause: IP not whitelisted or wrong base URL Fix: Check DAM_MTLS_BASE_URL uses the correct mTLS endpoint and verify IP is whitelisted

"No A1 campaigns found"

Expected behavior: Scripts check every 5 minutes via cron. This message is normal when no campaigns are in A1 status.

Database connection fails

Fix: Verify PostgreSQL is running on port 5437 (not default 5432)

docker ps | grep ferrero
PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c "SELECT 1;"

Emails not sending

Fix: Check SMTP configuration in .env, verify Mailgun credentials

Additional Documentation

  • README.md: Complete user documentation with deployment guide
  • DEPLOYMENT_GUIDE.md: Server deployment step-by-step
  • DATABASE_SCHEMA.md: Full schema reference with field descriptions
  • database/init.sql: Complete database schema with indexes