ferrero-opentext/Python-Version/MARKDOWN_DOCS/DATABASE_SCHEMA.md

798 lines
20 KiB
Markdown

# Ferrero Content Scaling - Database Schema Documentation
**Database:** ferrero_tracking
**PostgreSQL Version:** 15+
**Last Updated:** November 3, 2025
---
## Overview
The database tracks master assets, derivative assets, and campaign relationships for the Ferrero content scaling workflow.
**Key Features:**
- **Prefix-based tracking ID system** (master files start with 'M', regular files never start with 'M')
- Full metadata storage (JSONB, no truncation)
- Global campaign relationship tracking
- All-done checks for workflow automation
---
## Database Connection
**Local Development:**
```
Host: localhost
Port: 5437
Database: ferrero_tracking
User: ferrero_user
Password: ferrero_pass_2025
```
**Connection String:**
```
postgresql://ferrero_user:ferrero_pass_2025@localhost:5437/ferrero_tracking
```
---
## Tables
### master_assets
**Purpose:** Stores master assets downloaded from DAM with tracking IDs and full metadata
**Schema:**
```sql
CREATE TABLE master_assets (
-- Primary Key
id SERIAL PRIMARY KEY,
-- Tracking & Identification
tracking_id VARCHAR(6) UNIQUE NOT NULL,
opentext_id VARCHAR(255) NOT NULL,
-- File Information
original_filename VARCHAR(500) NOT NULL,
file_extension VARCHAR(20),
file_size_bytes BIGINT,
mime_type VARCHAR(100),
-- Metadata Fields (extracted for quick access)
brand_code VARCHAR(5),
brand_name VARCHAR(255),
country_code VARCHAR(2),
country_name VARCHAR(255),
language_code VARCHAR(3),
language_name VARCHAR(100),
subject_title VARCHAR(255),
asset_type VARCHAR(3),
asset_type_name VARCHAR(255),
duration_seconds INTEGER,
aspect_ratio VARCHAR(10),
width_px INTEGER,
height_px INTEGER,
-- Campaign Relationships (NEW!)
global_master_campaign_id VARCHAR(50),
global_master_folder_id VARCHAR(255),
-- Workflow Information
upload_directory VARCHAR(1000),
description TEXT,
-- Full Metadata Storage (NEW - CRITICAL!)
full_metadata JSONB,
-- Tags and Categories
tags TEXT[],
categories TEXT[],
-- Status
status VARCHAR(50) DEFAULT 'active',
is_deleted BOOLEAN DEFAULT FALSE,
deleted_at TIMESTAMP,
-- Tracking
ingested_by VARCHAR(255),
ingested_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
```
**Indexes:**
```sql
CREATE INDEX idx_master_assets_tracking_id ON master_assets(tracking_id);
CREATE INDEX idx_master_assets_opentext_id ON master_assets(opentext_id);
CREATE INDEX idx_master_assets_status ON master_assets(status);
CREATE INDEX idx_master_assets_brand_code ON master_assets(brand_code);
CREATE INDEX idx_master_assets_created_at ON master_assets(created_at);
CREATE INDEX idx_master_assets_global_master ON master_assets(global_master_campaign_id);
```
**Triggers:**
```sql
-- Auto-update updated_at timestamp
CREATE TRIGGER update_master_assets_updated_at
BEFORE UPDATE ON master_assets
FOR EACH ROW
EXECUTE FUNCTION update_updated_at_column();
-- Log changes
CREATE TRIGGER log_master_asset_changes
AFTER INSERT OR UPDATE OR DELETE ON master_assets
FOR EACH ROW
EXECUTE FUNCTION log_master_asset_event();
```
**Key Changes Made:**
1.**Added `full_metadata JSONB`** (Session 2) - Stores complete 200KB+ metadata without truncation
2.**Added `global_master_campaign_id`** (Session 3) - Tracks Global master campaign reference
3.**Added `global_master_folder_id`** (Session 3) - Stores Global master folder ID
---
### derivative_assets
**Purpose:** Tracks derivative assets (localized versions) created from master assets
**Schema:**
```sql
CREATE TABLE derivative_assets (
-- Primary Key
id SERIAL PRIMARY KEY,
-- Link to Master Asset
tracking_id VARCHAR(6) NOT NULL REFERENCES master_assets(tracking_id),
master_asset_id INTEGER REFERENCES master_assets(id) ON DELETE CASCADE,
-- Derivative File Information
derivative_filename VARCHAR(500) NOT NULL,
file_extension VARCHAR(20),
-- Metadata from V2 Filename
language_code VARCHAR(3),
language_name VARCHAR(100),
asset_type VARCHAR(3),
asset_type_name VARCHAR(255),
duration_seconds INTEGER,
aspect_ratio VARCHAR(10),
omg_job_number VARCHAR(10),
spot_version VARCHAR(10),
has_master_flag BOOLEAN DEFAULT FALSE,
-- Status Tracking
status VARCHAR(50) DEFAULT 'active',
uploaded_at TIMESTAMP,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
created_by VARCHAR(255)
);
```
**Indexes:**
```sql
CREATE INDEX idx_derivative_assets_tracking_id ON derivative_assets(tracking_id);
CREATE INDEX idx_derivative_assets_master_id ON derivative_assets(master_asset_id);
CREATE INDEX idx_derivative_assets_omg_job ON derivative_assets(omg_job_number);
```
**Foreign Keys:**
```sql
ALTER TABLE derivative_assets
ADD CONSTRAINT derivative_assets_tracking_id_fkey
FOREIGN KEY (tracking_id) REFERENCES master_assets(tracking_id);
ALTER TABLE derivative_assets
ADD CONSTRAINT derivative_assets_master_asset_id_fkey
FOREIGN KEY (master_asset_id) REFERENCES master_assets(id) ON DELETE CASCADE;
```
**Triggers:**
```sql
CREATE TRIGGER log_derivative_asset_changes
AFTER INSERT OR UPDATE ON derivative_assets
FOR EACH ROW
EXECUTE FUNCTION log_derivative_asset_event();
```
---
## Migration Scripts
### Complete Setup (Fresh Install)
**File:** `setup_database.sql`
```sql
-- Create database (run as postgres user)
CREATE DATABASE ferrero_tracking;
CREATE USER ferrero_user WITH PASSWORD 'ferrero_pass_2025';
GRANT ALL PRIVILEGES ON DATABASE ferrero_tracking TO ferrero_user;
-- Connect to database
\c ferrero_tracking
-- Create tables (see full schema above)
-- ... master_assets table ...
-- ... derivative_assets table ...
-- Create indexes
-- ... all indexes ...
-- Create triggers
CREATE OR REPLACE FUNCTION update_updated_at_column()
RETURNS TRIGGER AS $$
BEGIN
NEW.updated_at = CURRENT_TIMESTAMP;
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
CREATE OR REPLACE FUNCTION log_master_asset_event()
RETURNS TRIGGER AS $$
BEGIN
-- Log to audit table if exists
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
```
### Migration 1: Add full_metadata JSONB (October 30, 2025)
**Purpose:** Store complete DAM metadata without truncation
**Before:** Metadata truncated to 5,000 characters in description field
**After:** Full metadata stored in JSONB column (200KB+)
```sql
-- Add column
ALTER TABLE master_assets ADD COLUMN IF NOT EXISTS full_metadata JSONB;
-- Update existing records (if migrating from old schema)
-- Note: Old data was truncated, can't recover lost metadata
-- New downloads will populate this column
```
**Impact:**
- ✅ No more metadata truncation
- ✅ Can query metadata using PostgreSQL JSON operators
- ✅ Efficient storage and retrieval
- ✅ Enables full metadata preservation
### Migration 2: Add Global Campaign Reference (November 3, 2025)
**Purpose:** Track relationship between local campaigns and global masters
**Why:** Local Adaptation campaigns reference a Global Master campaign. This relationship needs to be stored for future cross-campaign features.
```sql
-- Add columns
ALTER TABLE master_assets
ADD COLUMN IF NOT EXISTS global_master_campaign_id VARCHAR(50);
ALTER TABLE master_assets
ADD COLUMN IF NOT EXISTS global_master_folder_id VARCHAR(255);
-- Add index for queries
CREATE INDEX IF NOT EXISTS idx_master_assets_global_master
ON master_assets(global_master_campaign_id);
```
**How It's Populated:**
- Extracted from `inherited_metadata_collections[]` in asset JSON
- Field: `FERRERO.FIELD.GLOBAL CAMPAIGN REFERENCE`
- Only populated for Local Adaptation assets (A1→A2 workflow)
- Global Master assets (B1→B2) have NULL values (they don't reference themselves)
**Impact:**
- ✅ Can find all local assets for a Global Master
- ✅ Enables cross-campaign reporting
- ✅ Foundation for future features
**Example Query:**
```sql
-- Find all local assets for Global Master C000000068
SELECT tracking_id, original_filename, created_at
FROM master_assets
WHERE global_master_campaign_id = 'C000000068'
ORDER BY created_at DESC;
```
---
## Column Descriptions
### master_assets Table
| Column | Type | Nullable | Default | Description |
|--------|------|----------|---------|-------------|
| `id` | SERIAL | NOT NULL | auto | Primary key |
| `tracking_id` | VARCHAR(6) | NOT NULL | - | **Unique 6-char tracking ID** (links master to derivatives)<br/>**Master files (B1→B2):** Start with 'M' (e.g., `M4x7Qp`)<br/>**Regular files (A1→A2, A5→A6):** Never start with 'M' (e.g., `a4x7Qp`) |
| `opentext_id` | VARCHAR(255) | NOT NULL | - | DAM asset ID (hex string like 0008a50...) |
| `original_filename` | VARCHAR(500) | NOT NULL | - | Filename without extension |
| `file_extension` | VARCHAR(20) | NULL | - | File extension with dot (.mp4, .jpg) |
| `file_size_bytes` | BIGINT | NULL | - | File size in bytes |
| `mime_type` | VARCHAR(100) | NULL | - | MIME type (video/mp4, image/jpeg) |
| `brand_code` | VARCHAR(5) | NULL | - | Brand code (RAF, KIN, etc.) |
| `country_code` | VARCHAR(2) | NULL | - | Country code (DE, IT, etc.) |
| `language_code` | VARCHAR(3) | NULL | - | Language code (de, it, etc.) |
| `asset_type` | VARCHAR(3) | NULL | - | Asset type code (TVC, OLV, etc.) |
| `width_px` | INTEGER | NULL | - | Video/image width in pixels |
| `height_px` | INTEGER | NULL | - | Video/image height in pixels |
| `upload_directory` | VARCHAR(1000) | NULL | - | **Final Assets folder ID for upload** |
| `description` | TEXT | NULL | - | Box file info (simplified, not metadata) |
| `full_metadata` | JSONB | NULL | - | **Complete DAM metadata** (200KB+, no truncation) ✨ |
| `global_master_campaign_id` | VARCHAR(50) | NULL | - | **Global master campaign ID** (C000000068) ✨ |
| `global_master_folder_id` | VARCHAR(255) | NULL | - | **Global master folder ID** ✨ |
| `status` | VARCHAR(50) | NULL | 'active' | Record status (active, deleted) |
| `created_at` | TIMESTAMP | NULL | NOW() | When record was created |
| `updated_at` | TIMESTAMP | NULL | NOW() | When record was last updated |
**✨ = New columns added during this project**
---
## Queries
### Common Queries
**1. Find all assets with tracking IDs:**
```sql
SELECT tracking_id, original_filename, created_at
FROM master_assets
WHERE status = 'active'
ORDER BY created_at DESC
LIMIT 20;
```
**2. Get full metadata for a tracking ID:**
```sql
SELECT full_metadata
FROM master_assets
WHERE tracking_id = 'ABC123';
```
**3. Find assets for a specific Global Master campaign:**
```sql
SELECT tracking_id, original_filename, global_master_campaign_id
FROM master_assets
WHERE global_master_campaign_id = 'C000000068'
ORDER BY created_at DESC;
```
**4. Check campaign upload completion:**
```sql
-- Count master assets
SELECT COUNT(DISTINCT tracking_id)
FROM master_assets
WHERE campaign_id = 'campaign_hex_id'
AND status = 'active';
-- Count uploaded derivatives
SELECT COUNT(DISTINCT ma.tracking_id)
FROM master_assets ma
INNER JOIN derivative_assets da ON ma.tracking_id = da.tracking_id
WHERE ma.campaign_id = 'campaign_hex_id';
```
**5. Get all local campaigns for a Global Master:**
```sql
-- Assets that reference a specific Global Master
SELECT
tracking_id,
original_filename,
global_master_campaign_id,
global_master_folder_id,
created_at
FROM master_assets
WHERE global_master_campaign_id = 'C000000068'
AND global_master_campaign_id IS NOT NULL
ORDER BY created_at DESC;
```
**6. Statistics:**
```sql
-- Count by status
SELECT status, COUNT(*)
FROM master_assets
GROUP BY status;
-- Count by workflow type
SELECT
CASE
WHEN global_master_campaign_id IS NULL THEN 'Global Master (B1)'
ELSE 'Local Adaptation (A1)'
END as workflow_type,
COUNT(*)
FROM master_assets
WHERE status = 'active'
GROUP BY workflow_type;
```
---
## Backup & Restore
### Backup Database
**Full backup:**
```bash
pg_dump -h localhost -p 5437 -U ferrero_user -d ferrero_tracking > backup_$(date +%Y%m%d).sql
```
**Schema only:**
```bash
pg_dump -h localhost -p 5437 -U ferrero_user -d ferrero_tracking --schema-only > schema.sql
```
**Data only:**
```bash
pg_dump -h localhost -p 5437 -U ferrero_user -d ferrero_tracking --data-only > data.sql
```
### Restore Database
**From backup:**
```bash
psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking < backup_20251103.sql
```
**Create fresh database:**
```bash
# Drop and recreate (CAUTION: Deletes all data!)
dropdb -h localhost -p 5437 -U ferrero_user ferrero_tracking
createdb -h localhost -p 5437 -U ferrero_user ferrero_tracking
# Restore schema and data
psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking < backup.sql
```
---
## Migration History
### Session 1 (October 29, 2025)
**Initial database setup**
- Created master_assets table
- Created derivative_assets table
- Basic tracking ID system
### Session 2 (October 30, 2025)
**Migration: Add full_metadata JSONB**
**Problem:** Metadata was truncated to 5,000 characters
**Solution:** Add JSONB column for complete metadata storage
```sql
ALTER TABLE master_assets ADD COLUMN full_metadata JSONB;
```
**Impact:**
- Before: 5,000 chars max (truncated 195KB of data)
- After: Full 200KB+ metadata preserved
- Enables complete metadata inheritance
### Session 3 (November 3, 2025)
**Migration: Add Global Campaign Reference**
**Problem:** No way to track which Global Master a local asset came from
**Solution:** Extract and store Global Campaign Reference
```sql
ALTER TABLE master_assets
ADD COLUMN global_master_campaign_id VARCHAR(50);
ALTER TABLE master_assets
ADD COLUMN global_master_folder_id VARCHAR(255);
CREATE INDEX idx_master_assets_global_master
ON master_assets(global_master_campaign_id);
```
**Impact:**
- Can query all local assets for a Global Master
- Enables cross-campaign features
- Foundation for reporting
---
## Production Deployment
### Server Setup
**1. Create Database:**
```bash
# On production server
sudo -u postgres psql
CREATE DATABASE ferrero_tracking;
CREATE USER ferrero_user WITH PASSWORD 'YOUR_SECURE_PASSWORD';
GRANT ALL PRIVILEGES ON DATABASE ferrero_tracking TO ferrero_user;
\q
```
**2. Import Schema:**
```bash
# Upload schema.sql to server
scp schema.sql user@server:~/
# Import
psql -h localhost -p 5432 -U ferrero_user -d ferrero_tracking < schema.sql
```
**3. Verify:**
```bash
psql -h localhost -p 5432 -U ferrero_user -d ferrero_tracking
# Check tables
\dt
# Check master_assets schema
\d master_assets
# Verify new columns
\d+ master_assets | grep -E "full_metadata|global_master"
```
### Environment Configuration
**Update .env with production database:**
```bash
DB_HOST=localhost # or production server IP
DB_PORT=5432 # or custom port
DB_USER=ferrero_user
DB_PASSWORD=your_production_password
```
---
## Maintenance
### Regular Maintenance
**Vacuum database (weekly):**
```sql
VACUUM ANALYZE master_assets;
VACUUM ANALYZE derivative_assets;
```
**Check database size:**
```sql
SELECT pg_size_pretty(pg_database_size('ferrero_tracking'));
```
**Check table sizes:**
```sql
SELECT
tablename,
pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;
```
### Cleanup Old Records
**Archive old tracking IDs (optional):**
```sql
-- Move to archive table after 6 months
INSERT INTO master_assets_archive
SELECT * FROM master_assets
WHERE created_at < NOW() - INTERVAL '6 months'
AND status = 'active';
-- Update status
UPDATE master_assets
SET status = 'archived', updated_at = CURRENT_TIMESTAMP
WHERE created_at < NOW() - INTERVAL '6 months';
```
---
## Data Examples
### Sample master_assets Record
```sql
tracking_id: ABC123
opentext_id: 0008a50461e6a554b5eeec2effc02222fedf2801
original_filename: 06_RAFFAELLO_MAESTRO_SD
file_extension: .mp4
file_size_bytes: 251723
mime_type: video/mp4
brand_code: RAF
country_code: NULL
language_code: NULL
asset_type: TVC
upload_directory: ea0dbf86e13e3634895746d3a986558ec2eb8be1
description: Box File ID: 2031029170878
Box URL: https://app.box.com/file/2031029170878
DAM Asset ID: 0008a50461e6a554b5eeec2effc02222fedf2801
full_metadata: {
"name": "06_RAFFAELLO_MAESTRO_SD.mp4",
"asset_id": "0008a50461e6a554b5eeec2effc02222fedf2801",
"metadata": { ... complete 200KB metadata ... },
"inherited_metadata_collections": [ ... campaign info ... ]
}
global_master_campaign_id: C000000068
global_master_folder_id: 676f2bcde4c7bcf7ef783e97f7495069bf50b6bc
status: active
created_at: 2025-11-03 13:20:00
```
### Sample derivative_assets Record
```sql
tracking_id: ABC123
derivative_filename: RAF_DE_de_TEST_OLV_001_6S_16x9
file_extension: .mp4
language_code: DE
asset_type: OLV
omg_job_number: 1234567
status: active
uploaded_at: 2025-11-03 14:30:00
```
---
## Monitoring Queries
### Dashboard Queries
**Recent activity (last 24 hours):**
```sql
SELECT
COUNT(*) as total_assets,
COUNT(DISTINCT DATE(created_at)) as days,
MIN(created_at) as earliest,
MAX(created_at) as latest
FROM master_assets
WHERE created_at > NOW() - INTERVAL '24 hours';
```
**Assets by workflow type:**
```sql
SELECT
CASE
WHEN global_master_campaign_id IS NULL THEN 'Global Masters (B1→B2)'
ELSE 'Local Adaptation (A1→A2)'
END as workflow,
COUNT(*) as count,
COUNT(DISTINCT DATE(created_at)) as days_active
FROM master_assets
WHERE created_at > NOW() - INTERVAL '7 days'
AND status = 'active'
GROUP BY workflow;
```
**Upload completion rate:**
```sql
SELECT
COUNT(DISTINCT ma.tracking_id) as total_masters,
COUNT(DISTINCT da.tracking_id) as uploaded_derivatives,
ROUND(100.0 * COUNT(DISTINCT da.tracking_id) / COUNT(DISTINCT ma.tracking_id), 2) as completion_rate
FROM master_assets ma
LEFT JOIN derivative_assets da ON ma.tracking_id = da.tracking_id
WHERE ma.created_at > NOW() - INTERVAL '7 days';
```
---
## Performance Optimization
### Recommended Indexes
Already created:
```sql
idx_master_assets_tracking_id (UNIQUE)
idx_master_assets_opentext_id
idx_master_assets_status
idx_master_assets_created_at
idx_master_assets_global_master (NEW!)
```
### JSONB Indexing (if needed for queries)
**GIN index for full_metadata queries:**
```sql
-- If you need to query inside full_metadata JSONB
CREATE INDEX idx_master_assets_full_metadata_gin
ON master_assets USING GIN (full_metadata);
```
**Specific JSON path indexes:**
```sql
-- Index on specific metadata field
CREATE INDEX idx_full_metadata_asset_type
ON master_assets ((full_metadata->'metadata'->'metadata_element_list'));
```
---
## Security
### User Permissions
**Application user (ferrero_user):**
```sql
-- Should have INSERT, SELECT, UPDATE on master_assets and derivative_assets
GRANT SELECT, INSERT, UPDATE ON master_assets TO ferrero_user;
GRANT SELECT, INSERT, UPDATE ON derivative_assets TO ferrero_user;
GRANT USAGE, SELECT ON ALL SEQUENCES IN SCHEMA public TO ferrero_user;
```
**Read-only user (for reporting):**
```sql
CREATE USER ferrero_readonly WITH PASSWORD 'readonly_password';
GRANT CONNECT ON DATABASE ferrero_tracking TO ferrero_readonly;
GRANT SELECT ON master_assets TO ferrero_readonly;
GRANT SELECT ON derivative_assets TO ferrero_readonly;
```
### SSL/TLS
**For production, enable SSL:**
```sql
-- postgresql.conf
ssl = on
ssl_cert_file = '/path/to/server.crt'
ssl_key_file = '/path/to/server.key'
```
---
## Troubleshooting
### Common Issues
**Issue: Connection refused**
```bash
# Check PostgreSQL is running
pg_isready -h localhost -p 5437
# Check if port is open
netstat -an | grep 5437
```
**Issue: Permission denied**
```bash
# Verify user has permissions
psql -h localhost -p 5437 -U ferrero_user -d ferrero_tracking \
-c "SELECT has_table_privilege('ferrero_user', 'master_assets', 'INSERT');"
```
**Issue: Duplicate tracking_id**
```bash
# Check for duplicates
SELECT tracking_id, COUNT(*)
FROM master_assets
GROUP BY tracking_id
HAVING COUNT(*) > 1;
```
**Issue: full_metadata too large**
```sql
-- Check metadata sizes
SELECT
tracking_id,
pg_size_pretty(pg_column_size(full_metadata)) as metadata_size
FROM master_assets
WHERE full_metadata IS NOT NULL
ORDER BY pg_column_size(full_metadata) DESC
LIMIT 10;
```
---
## Version History
| Version | Date | Changes |
|---------|------|---------|
| 1.0 | Oct 29, 2025 | Initial schema |
| 1.1 | Oct 30, 2025 | Added full_metadata JSONB |
| 2.0 | Nov 3, 2025 | Added global_master_campaign_id and global_master_folder_id |
---
**For complete documentation, see [PROJECT_STATUS_2025-11-03.md](../PROJECT_STATUS_2025-11-03.md)**