ferrero-opentext/Python-Version/MARKDOWN_DOCS/CLAUDE.md
nickviljoen 03c5ab65a8 Docs: Update README and CLAUDE.md with folder-only template and EOL workflow
Added documentation for template-based folder-only mode (-N flag),
asset type overrides (EOL), environment-specific field mappings,
and updated config file references.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 21:33:35 +02:00

404 lines
15 KiB
Markdown

# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
## Project Overview
**Ferrero DAM Content Scaling - Automated Python Workflow System**
This is a production automation system that orchestrates asset workflows between OpenText DAM, Box, and PostgreSQL. The system processes digital assets through multiple lifecycle stages (A1→A2, A5→A6, B1→B2, A2→A3) with automated email notifications and daily reporting.
**Key characteristics:**
- Production system with cron automation (every 5 minutes)
- Dual authentication support: OAuth2 (default) and mTLS certificate
- Python 3.6+ compatible for server deployment
- Heavy use of external APIs (OpenText DAM, Box SDK)
- PostgreSQL database with JSONB metadata storage
## Common Commands
### Development Setup
```bash
# Create and activate virtual environment
python3 -m venv venv
source venv/bin/activate
# Install dependencies
pip install -r requirements.txt
# Test connections
python scripts/test_connection.py
```
### Running Workflows
**OAuth2 (default):**
```bash
python scripts/a1_to_a2_download.py
python scripts/a5_to_a6_download.py
python scripts/b1_to_b2_download.py
python scripts/a2_to_a3_upload_polling.py
```
**mTLS authentication (requires --auth-pfx flag and whitelisted IP):**
```bash
python scripts/a1_to_a2_download.py --auth-pfx
python scripts/a5_to_a6_download.py --auth-pfx
python scripts/b1_to_b2_download.py --auth-pfx
```
### Testing
```bash
# Test all connections
python scripts/test_connection.py
# Test mTLS certificate
python scripts/test_mtls_cert.py
python scripts/test_connection.py --auth-pfx
# Generate daily report manually
python scripts/daily_report.py
```
### Database Operations
```bash
# Connect to database
PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking
# View recent assets
psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c "SELECT tracking_id, original_filename, created_at FROM master_assets ORDER BY created_at DESC LIMIT 10;"
# Initialize/reset database
psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -f database/init.sql
```
### Monitoring
```bash
# Watch workflow logs
tail -f logs/a1_to_a2.log
tail -f logs/a5_to_a6.log
# Search for errors
grep -i error logs/*.log
grep -i "NOT APPROVED" logs/a5_to_a6.log
# View cron execution logs
tail -f logs/cron_a1_a2.log
```
## Architecture
### System Design
**Multi-stage workflow orchestration:**
1. **A1→A2**: Downloads master assets from DAM, uploads to Box with tracking IDs
2. **A5→A6**: Filters and downloads ONLY rejected assets (ECOMMERCE STATUS = "NOT APPROVED")
3. **B1→B2**: Handles global master campaigns separately
4. **A2→A3**: Polls Box for agency uploads, matches tracking IDs, uploads back to DAM
### Core Components
**scripts/shared/** - Shared library modules used by all workflows:
- **dam_client.py**: OpenText DAM API client with dual authentication
- OAuth2 token management with auto-refresh
- mTLS certificate authentication (PFX to PEM conversion)
- Recursive folder search for assets (critical for subfolder support)
- Asset download and metadata extraction
- **Important**: Uses different base URLs for OAuth2 vs mTLS
- **box_client.py**: Box SDK wrapper
- JWT authentication via Box-config.json (located one folder up: `../Box-config.json`)
- Creates campaign folders with tracking ID naming
- Preserves DAM folder structure during upload
- Three separate Box folders for different workflows (A1→A2, A2→A3, B1→B2)
- **database.py**: PostgreSQL connection pool and operations
- Generates unique 6-character tracking IDs
- `find_or_create_tracking_id()`: Smart lookup by opentext_id + local_campaign_id (reuses IDs for rework workflow)
- Stores full metadata in JSONB column (no truncation)
- Checks campaign completion status for A1→A2 updates
- **notifier.py**: Email and webhook notifications
- 12 email templates with color-coded themes
- Mailgun SMTP integration
- Detailed asset listings with Box URLs
- Rejection comment extraction (A5→A6 specific)
- **metadata_extractor_mvp.py**: Field mapping and metadata transformation
- Loads MVP fields from environment-specific config (`field_mappings_ppr.yaml` or `field_mappings_prod.yaml`)
- **Two tracking modes:** Full inheritance (from master metadata) and folder-only (`-N` suffix, uses `config/asset_representation_template.json` as base)
- Handles filename-based updates, forced values, defaults, and asset type overrides
- Asset type overrides (e.g., EOL) can set/remove fields with final precedence
- Force-sets required values (e.g., STATE = "Local")
- Uses DomainValue format for domained fields when setting values on template fields
### Configuration Architecture
**Hierarchical config system:**
- `.env`: Environment variables (credentials, never committed)
- `config/config.yaml`: Main configuration (references .env vars)
- `config/field_mappings_ppr.yaml`: PPR field definitions (auto-loaded when DAM URL contains 'ppr')
- `config/field_mappings_prod.yaml`: PROD field definitions (auto-loaded otherwise)
- `config/asset_type_mappings.yaml`: 3-letter code to DAM code mappings (e.g., EOL -> externallegalopinion)
- `config/asset_representation_template.json`: Reference template for folder-only mode (-N flag), contains full field metadata structure
- `../Box-config.json`: Box JWT credentials (one directory up from Python-Version)
**Important**: Box-config.json MUST be located at `../Box-config.json` (one folder up). This is hardcoded in config.yaml as `rsa_private_key_path: ../Box-config.json`.
### Database Schema
**Four main tables:**
1. **master_assets**: 35 columns with JSONB full_metadata
- tracking_id (6-char unique), opentext_id, original_filename
- Campaign relationships: global_master_campaign_id, local_campaign_id
- JSONB storage prevents metadata truncation
2. **derivative_assets**: Localized versions linked by tracking_id
3. **asset_events**: Audit trail of all operations
4. **workflow_state**: Tracks campaign processing state
**Key indexes:**
- `idx_master_assets_tracking_id` for fast lookups
- `idx_master_assets_opentext_local` for rework workflow matching
- GIN indexes on JSONB fields for metadata queries
### Workflow-Specific Logic
**A1→A2 (Master Download):**
- Processes ONE campaign per run (cron handles frequency)
- Recursive search in "Master Assets" folder
- ONLY updates status A1→A2 if ALL assets succeed
- Uses `BOX_ROOT_FOLDER_A1_A2` (348304357505)
**A5→A6 (Rejection/Rework):**
- **Critical filter**: Only processes assets where `ECOMMERCE STATUS = "NOT APPROVED"`
- Extracts 10 rejection-related fields (approver/legal/IA&CC comments)
- Reuses tracking IDs via `find_or_create_tracking_id()`
- Uses separate Box folder: `BOX_ROOT_FOLDER_A5_A6` (349441822875)
- Search is in "Final Assets" folder, not "Master Assets"
**B1→B2 (Global Masters):**
- Filters for campaigns where type = "Global comm" AND status = B1
- Uses "MASTERS_" prefix for Box folder names
- Searches "Final Assets" folder (B1 uses Final, not Master)
- Uses `BOX_ROOT_FOLDER_B1_B2` (349261192115)
- Does NOT send webhook (only email)
**A2→A3 (Upload from Box):**
- Polls `BOX_ROOT_FOLDER_A2_A3` (348526703108) for new files
- Parses tracking ID from filename (V2 format)
- Two tracking modes:
- **Full inheritance**: Loads master metadata from database, inherits all fields
- **Folder-only** (`-N` suffix): Uses `config/asset_representation_template.json` as base, populates from filename
- Updates Description, Language, State fields from filename
- Applies asset type overrides (e.g., EOL sets Agency="-", Languages="Global", IPRights="Yes", removes validity dates)
- Deletes file from Box after successful upload
- Updates campaign status A2→A3 when ALL assets uploaded
### Authentication Modes
**Two separate authentication systems:**
**OAuth2 (Production Default):**
- Base URL: `DAM_BASE_URL` (https://ppr.dam.ferrero.com/otmmapi)
- Token endpoint: `DAM_AUTH_URL`
- Auto-refreshing access token (1 hour expiry)
- No special flags needed
**mTLS Certificate (Optional):**
- Base URL: `DAM_MTLS_BASE_URL` (different endpoint: https://dev-auth.app-api.ferrero.com/00003/mm)
- Requires PFX certificate at `DAM_MTLS_CERT_PATH`
- Runtime flag: `--auth-pfx`
- Requires IP whitelisting
- PFX converted to PEM on-the-fly using cryptography library
## Development Guidelines
### Adding New Workflow Scripts
New workflow scripts should follow this pattern:
1. Import from `scripts.shared` modules
2. Setup rotating log handler (10MB, 28 backups)
3. Parse command-line args (include `--auth-pfx` for mTLS support)
4. Load config via `load_config('config/config.yaml')`
5. Initialize clients (DAMClient, BoxClient, Database, Notifier)
6. Process campaigns individually (one per run for cron safety)
7. Send email notifications for all outcomes (success/partial/error)
8. Log extensively with campaign and asset context
### Modifying Field Mappings
**To add/remove fields, edit the environment-specific file:**
- PPR: `config/field_mappings_ppr.yaml`
- PROD: `config/field_mappings_prod.yaml`
```yaml
mvp_fields:
- FERRERO.FIELD.NEW_FIELD_NAME # Add here
- FERRERO.FIELD.ANOTHER_FIELD # Add here
```
**No code changes required** - the system dynamically loads fields at runtime.
### Asset Type Overrides
To add field overrides for a specific asset type, add an `asset_type_overrides` section to the field mappings file:
```yaml
asset_type_overrides:
EOL: # Keyed by 3-letter asset type code
FERRERO.MARKETING.FIELD.AGENCY NAME: "-"
MAIN_LANGUAGES: "Global"
FERRERO.FIELD.ASSET VALIDITY START PERIOD: "" # Empty string removes the field
```
Overrides run after all other field processing (forced values, defaults) and take final precedence. An empty string value removes the field entirely from the payload.
### Database Queries
Common patterns used in the codebase:
```python
# Generate unique tracking ID
tracking_id = db.generate_unique_tracking_id()
# Find or reuse existing tracking ID (for rework)
result = db.find_or_create_tracking_id(opentext_id, local_campaign_id)
# Store master asset with full metadata
db.store_master_asset(
tracking_id=tracking_id,
opentext_id=asset_id,
asset_data=asset, # Full dict stored in JSONB
box_file_id=box_result['file_id'],
box_url=box_result['url'],
upload_folder_id=final_folder_id,
global_master_campaign_id=global_ref['global_master_campaign_id'],
local_campaign_id=campaign_number
)
# Check if all campaign assets uploaded
all_done = db.check_all_campaign_assets_uploaded(campaign_number)
```
### Email Notifications
**12 template types in notifier.py:**
- a1_to_a2_complete, a1_to_a2_partial, a1_to_a2_no_assets
- a5_to_a6_rejections, a5_to_a6_partial, a5_to_a6_no_rejections
- b1_to_b2_complete, b1_to_b2_partial, b1_to_b2_no_assets
- a2_to_a3_file_uploaded, a2_to_a3_complete
- daily_summary_report
Each template includes campaign context, asset lists, Box URLs, and next steps.
### Error Handling Patterns
The codebase uses extensive try/except with logging:
```python
try:
# Process asset
file_path = dam.download_asset(asset_id, output_dir)
box_result = box.upload_with_tracking_id(file_path, ...)
db.store_master_asset(...)
processed_assets.append(asset_info)
except Exception as e:
logger.error("Failed: {}".format(str(e)))
failed_assets.append(asset_info)
```
**Important**: Workflows only update campaign status if ALL assets succeed to prevent partial state issues.
### Python 3.6 Compatibility
**Required for server deployment:**
- Use `.format()` instead of f-strings
- Avoid walrus operator (`:=`)
- No dict union operator (`|`)
- Type hints are optional
- Use `os.path` instead of `Path` where possible
## Deployment Notes
### Server Directory Structure
```
/opt/ferrero-automation/
├── Box-config.json # Box JWT config HERE (one level up!)
└── Python-Version/
├── .env # Environment variables
├── config/
│ ├── config.yaml
│ ├── field_mappings_ppr.yaml
│ ├── field_mappings_prod.yaml
│ ├── asset_type_mappings.yaml
│ ├── asset_representation_template.json
│ └── certificates/
│ └── dam-mtls-dev.pfx
├── database/
│ └── init.sql
├── scripts/
│ ├── *.py # Workflow scripts
│ └── shared/ # Shared modules
├── logs/ # Auto-created by scripts
├── temp/ # Auto-created for downloads
└── venv/ # Virtual environment
```
### Cron Configuration
Production cron runs workflows every 5 minutes:
```cron
*/5 * * * * cd /opt/ferrero-automation/Python-Version && venv/bin/python scripts/a1_to_a2_download.py >> logs/cron_a1_a2.log 2>&1
*/5 * * * * cd /opt/ferrero-automation/Python-Version && venv/bin/python scripts/a5_to_a6_download.py >> logs/cron_a5_a6.log 2>&1
*/5 * * * * cd /opt/ferrero-automation/Python-Version && venv/bin/python scripts/b1_to_b2_download.py >> logs/cron_b1_b2.log 2>&1
*/5 * * * * cd /opt/ferrero-automation/Python-Version && venv/bin/python scripts/a2_to_a3_upload_polling.py >> logs/cron_a2_a3.log 2>&1
0 19 * * * cd /opt/ferrero-automation/Python-Version && venv/bin/python scripts/daily_report.py >> logs/daily_report.log 2>&1
```
### Security Considerations
**Files to protect (chmod 600):**
- `.env` (all credentials)
- `config/certificates/*.pfx` (mTLS certificates)
- `../Box-config.json` (Box JWT private key)
**Gitignored secrets:**
- `.env`
- `config/certificates/`
- `*.pfx`, `*.p12`
- `logs/`, `temp/`
**Production checklist:**
- Change default database password (ferrero_pass_2025)
- Update SMTP credentials
- Get production mTLS certificate (current is dev)
- Verify all Box folder IDs are production IDs
- Configure firewall rules for IP whitelisting (if using mTLS)
## Troubleshooting
### "Box-config.json not found"
**Cause**: Box JWT file must be one directory up from Python-Version
**Fix**: Verify file exists at `../Box-config.json` relative to Python-Version directory
### "mTLS connection refused"
**Cause**: IP not whitelisted or wrong base URL
**Fix**: Check `DAM_MTLS_BASE_URL` uses the correct mTLS endpoint and verify IP is whitelisted
### "No A1 campaigns found"
**Expected behavior**: Scripts check every 5 minutes via cron. This message is normal when no campaigns are in A1 status.
### Database connection fails
**Fix**: Verify PostgreSQL is running on port 5437 (not default 5432)
```bash
docker ps | grep ferrero
PGPASSWORD=ferrero_pass_2025 psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking -c "SELECT 1;"
```
### Emails not sending
**Fix**: Check SMTP configuration in .env, verify Mailgun credentials
## Additional Documentation
- **README.md**: Complete user documentation with deployment guide
- **DEPLOYMENT_GUIDE.md**: Server deployment step-by-step
- **DATABASE_SCHEMA.md**: Full schema reference with field descriptions
- **database/init.sql**: Complete database schema with indexes