docs: add backend architecture documentation for campaign reuse

Comprehensive technical reference covering system architecture, Celery queue topology, Cloud Run video offloading, resilience patterns, and configuration. Includes 6 Mermaid diagrams. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-14 09:40:39 -06:00 · 2026-02-14 09:40:39 -06:00 · 792e1323e9
commit 792e1323e9
parent 371b225d9c
1 changed files with 738 additions and 0 deletions
--- a/documentation/architecture.md
+++ b/documentation/architecture.md
@ -0,0 +1,738 @@
+# Backend Architecture: Pet Love Song Generator
+
+**Stack:** FastAPI | Celery | PostgreSQL 15 | Redis 7 | Google Cloud Storage | Cloud Run
+
+---
+
+## Executive Summary
+
+The Pet Love Song Generator is a campaign microsite that lets users submit their pet's name, photo, and a music vibe preference to receive an AI-generated love song rendered as a vinyl-record music video. The backend receives form submissions, dispatches them to an external AI music generation API (Sonauto), processes webhook callbacks, downloads the generated audio, and creates a composited video — all asynchronously.
+
+The key architectural decision is **offloading CPU-intensive video rendering to Google Cloud Run**. The main backend handles API serving, database operations, and task orchestration, while each video render gets its own ephemeral Cloud Run container instance. This separation means that a burst of 100 simultaneous video renders doesn't compete with API request handling or webhook processing on the main server. Cloud Run's auto-scaling matches the bursty traffic pattern of a marketing campaign (peaks during social shares, lulls overnight), and its scale-to-zero capability keeps costs proportional to actual usage.
+
+The system was designed to handle campaign-scale bursts — hundreds of concurrent submissions arriving within minutes of a social media push — while remaining cost-efficient during quiet periods. Resilience patterns (retry strategies, two-layer timeouts, fail-fast startup checks) ensure that transient failures in external APIs or infrastructure don't silently lose user submissions.
+
+---
+
+## System Architecture Overview
+
+The application runs as six Docker Compose services on a single host, plus a Cloud Run satellite service for video rendering:
+
+| Service | Role | Image |
+|---------|------|-------|
+| **web** | FastAPI application server | Custom Dockerfile |
+| **celery_worker_default** | Default task worker (API calls, audio downloads, DB ops) | Custom Dockerfile |
+| **celery_worker_video** | Video dispatch worker (HTTP calls to Cloud Run) | Custom Dockerfile |
+| **celery_beat** | Periodic task scheduler | Custom Dockerfile |
+| **db** | PostgreSQL 15 database | `postgres:15` |
+| **redis** | Message broker + result backend + cache | `redis:7-alpine` |
+
+The Cloud Run video service runs independently, sharing only the GCS bucket as a data plane.
+
+```mermaid
+graph LR
+    subgraph "User"
+        Browser[Browser / PHP Frontend]
+    end
+
+    subgraph "Docker Compose Host"
+        FastAPI[FastAPI<br/>web service]
+        WorkerDefault[Celery Worker<br/>default queue<br/>concurrency=10]
+        WorkerVideo[Celery Worker<br/>video queue<br/>concurrency=100]
+        Beat[Celery Beat<br/>scheduler]
+        Redis[(Redis 7)]
+        Postgres[(PostgreSQL 15)]
+    end
+
+    subgraph "Google Cloud"
+        CloudRun[Cloud Run<br/>Video Service]
+        GCS[(GCS Bucket<br/>vday2026)]
+    end
+
+    subgraph "External"
+        Sonauto[Sonauto API<br/>Music Generation]
+    end
+
+    Browser -->|POST /api/submissions| FastAPI
+    Browser -->|GET /api/submissions/.../status| FastAPI
+    FastAPI --> Postgres
+    FastAPI --> Redis
+    FastAPI -->|upload photo| GCS
+
+    Redis --> WorkerDefault
+    Redis --> WorkerVideo
+    Beat --> Redis
+
+    WorkerDefault -->|send prompt| Sonauto
+    WorkerDefault -->|download audio| GCS
+    Sonauto -->|webhook callback| FastAPI
+
+    WorkerVideo -->|POST /generate| CloudRun
+    CloudRun -->|read audio, photo| GCS
+    CloudRun -->|write video| GCS
+
+    Browser -->|stream video| GCS
+```
+
+---
+
+## Request Lifecycle
+
+A submission moves through three distinct phases: **Submission**, **Generation**, and **Delivery**.
+
+### Phase 1: Submission
+
+1. User fills out the form (pet name, owner name, pet type, music vibe, optional photo)
+2. Browser sends `POST /api/submissions` with form data and base64-encoded cropped image
+3. FastAPI validates input, runs server-side profanity check on pet/owner names
+4. If a photo is provided, it's decoded from base64 and uploaded to GCS (`uploads/{session_id}.jpg`)
+5. A `Submission` record is created in PostgreSQL with status `pending`
+6. A `send_to_sonauto` Celery task is dispatched immediately
+7. Response returns `session_id` to the browser for polling
+
+### Phase 2: Generation
+
+8. The `send_to_sonauto` task constructs a prompt from pet-specific templates and sends it to the Sonauto API
+9. The submission status moves to `processing` and a per-submission timeout task is scheduled
+10. Sonauto generates the song asynchronously and sends webhook callbacks:
+    - `GENERATING_STREAMING_READY` — the audio stream is available for early playback (the frontend can start playing before the full song is done)
+    - `SUCCESS` — full generation complete, triggers the post-webhook processing chain
+    - `FAILURE` — marks the submission as failed
+11. On `SUCCESS`, a Celery chain is triggered: `fetch_generation_details → download_audio → create_video`
+12. The chain fetches song details from the Sonauto API, downloads the MP3 to GCS, then dispatches video creation
+13. Video creation sends a request to Cloud Run, which renders the vinyl record animation using FFmpeg and uploads the result to GCS
+
+### Phase 3: Delivery
+
+14. The frontend polls `GET /api/submissions/{session_id}/status` on an interval
+15. Once status is `success`, the frontend redirects to the results page
+16. The results page fetches `GET /api/results/{session_id}` which returns the video URL, lyrics, record image, and metadata
+17. The video is served directly from GCS public URLs; downloads use signed URLs with `Content-Disposition: attachment`
+
+**Streaming-ready optimization:** When the `GENERATING_STREAMING_READY` webhook arrives, the status endpoint includes `streaming_ready: true` and the Sonauto `task_id`. The frontend can begin audio playback via Sonauto's streaming endpoint while the full song and video are still being generated. This reduces perceived wait time significantly.
+
+```mermaid
+sequenceDiagram
+    participant User as Browser
+    participant API as FastAPI
+    participant GCS as GCS Bucket
+    participant DB as PostgreSQL
+    participant Redis as Redis
+    participant Worker as Celery Worker<br/>(default)
+    participant Sonauto as Sonauto API
+    participant VideoW as Celery Worker<br/>(video)
+    participant CR as Cloud Run
+
+    User->>API: POST /api/submissions
+    API->>GCS: Upload pet photo
+    API->>DB: INSERT submission (pending)
+    API->>Redis: Enqueue send_to_sonauto
+    API-->>User: { session_id }
+
+    Worker->>Redis: Dequeue task
+    Worker->>Sonauto: POST /generations/v3
+    Worker->>DB: UPDATE (processing)
+    Worker->>Redis: Schedule timeout check
+
+    Note over Sonauto: AI generates song...
+
+    Sonauto-->>API: Webhook (STREAMING_READY)
+    API->>DB: SET streaming_ready_at
+
+    User->>API: GET /status (polling)
+    API-->>User: { streaming_ready: true }
+    Note over User: Begin audio playback
+
+    Sonauto-->>API: Webhook (SUCCESS)
+    API->>DB: SET received_from_LLM
+    API->>Redis: Enqueue chain
+
+    Worker->>Sonauto: GET /generations/{task_id}
+    Worker->>DB: Store lyrics, song URL
+    Worker->>GCS: Download & upload audio
+    Worker->>DB: SET generated_song_path
+
+    Worker->>Redis: Pass to video queue
+    VideoW->>CR: POST /generate
+    CR->>GCS: Download audio + photo
+    Note over CR: FFmpeg renders video (~10s)
+    CR->>GCS: Upload video + composite
+    CR-->>VideoW: { video_blob_path }
+    VideoW->>DB: SET status=success
+
+    User->>API: GET /status (polling)
+    API-->>User: { status: success }
+    User->>API: GET /results/{session_id}
+    API-->>User: { video_url, lyrics, ... }
+    User->>GCS: Stream video
+```
+
+---
+
+## Celery Task Chain Architecture
+
+After the Sonauto webhook signals `SUCCESS`, a three-task Celery chain processes the result. Each task receives a dict from its predecessor and returns a dict consumed by the next.
+
+### Chain: `fetch_generation_details → download_audio → create_video`
+
+**Task 1 — `fetch_generation_details`**
+- Calls `GET /generations/{task_id}` on the Sonauto API
+- Extracts song URL, lyrics, and status from the response
+- Stores the full API response, lyrics, and song URL in the database
+- Returns: `{ session_id, song_url, lyrics }`
+
+**Task 2 — `download_audio`**
+- Downloads the MP3 from the Sonauto CDN URL
+- Uploads it to GCS at `audio/{session_id}.mp3`
+- Updates the database with the blob path
+- Returns: `{ session_id, audio_blob_path, lyrics }`
+
+**Task 3 — `create_video`**
+- Routes to the `video` queue (separate worker pool)
+- If `CLOUD_RUN_VIDEO_URL` is configured: dispatches to Cloud Run via HTTP POST
+- Otherwise: falls back to local FFmpeg rendering (development mode)
+- Marks the submission as `success` or `fail` based on the outcome
+- Returns: `{ session_id, video_blob_path }`
+
+### Retry Configuration
+
+All three chain tasks use the same retry pattern:
+
+```python
+@shared_task(
+    autoretry_for=(requests.RequestException,),
+    retry_backoff=True,          # Exponential backoff
+    retry_backoff_max=60,        # Cap at 60 seconds
+    retry_kwargs={"max_retries": 3},
+)
+```
+
+Additionally, `send_to_sonauto` tracks retries at the application level via a `retry_count` column. If the count reaches `MAX_RETRIES` (default: 3), the submission is marked as `fail` even before Celery exhausts its own retry budget — preventing indefinite requeuing.
+
+```mermaid
+graph LR
+    Webhook[Webhook<br/>SUCCESS] --> Fetch[fetch_generation<br/>_details]
+    Fetch --> Download[download_audio]
+    Download --> Video[create_video]
+    Video --> Success[status = success]
+
+    Fetch -->|RequestException<br/>3 retries| Fetch
+    Download -->|RequestException<br/>3 retries| Download
+    Video -->|RequestException<br/>2 retries| Video
+
+    Fetch -->|max retries exceeded| Fail[status = fail]
+    Download -->|max retries exceeded| Fail
+    Video -->|max retries exceeded| Fail
+```
+
+---
+
+## Queue Topology & Concurrency Model
+
+This is the most critical load-handling pattern in the system. Two separate Celery queues with independent worker pools prevent video dispatch tasks from starving API-calling tasks during traffic bursts.
+
+### Why Queue Separation Matters
+
+Video generation takes approximately 10 seconds per submission. During a traffic burst — say, 200 submissions arriving within a few minutes after a social media push — the system needs to:
+
+1. Continue accepting and acknowledging Sonauto API calls (`send_to_sonauto`)
+2. Continue processing webhook callbacks (the chain's first two tasks)
+3. Dispatch up to 100 video renders in parallel to Cloud Run
+
+Without queue separation, all these tasks would compete for the same worker slots. If the default worker had 10 concurrency slots, only 10 video dispatches could run at once — and while those slots are occupied waiting for Cloud Run HTTP responses (~10s each), no API calls or audio downloads could proceed. The entire pipeline would bottleneck.
+
+### Queue Configuration
+
+**"celery" queue** (default) — concurrency: 10
+- `send_to_sonauto` — outbound API calls to Sonauto
+- `fetch_generation_details` — fetch song metadata from Sonauto
+- `download_audio` — download MP3 from CDN, upload to GCS
+- `check_submission_timeout` — per-submission timeout checks
+- `check_timeouts` — periodic safety-net sweep
+- `cleanup_old_files` — daily file retention cleanup
+
+These tasks involve API calls, file downloads, and database operations. Concurrency of 10 is sufficient because each task completes in seconds, and the external APIs have their own rate limits.
+
+**"video" queue** — concurrency: 100
+- `create_video` — HTTP dispatch to Cloud Run (or local FFmpeg fallback)
+
+This worker does almost no local work. It sends an HTTP POST to Cloud Run and waits for the response. The actual CPU work happens on Cloud Run. With 100 concurrency, the system can have 100 Cloud Run instances rendering videos simultaneously. The high concurrency is safe because the worker is purely I/O-bound — it's just holding open HTTP connections.
+
+### Worker-Level Settings
+
+```python
+# celery_app.py
+"task_acks_late": True,
+"worker_prefetch_multiplier": 1,
+"task_routes": {
+    "tasks.workers.create_video": {"queue": "video"},
+},
+"task_default_queue": "celery",
+```
+
+- **`worker_prefetch_multiplier=1`**: Each worker only grabs one task at a time from the broker. This ensures fair distribution across workers — without it, a single worker could prefetch a batch of tasks and create an uneven load.
+- **`task_acks_late=True`**: Tasks are acknowledged only after completion, not when received. If a worker crashes mid-task, the message returns to the queue and another worker picks it up. This is the crash-recovery mechanism.
+
+```mermaid
+graph TB
+    Redis[(Redis Broker)]
+
+    Redis --> CeleryQ["celery" queue]
+    Redis --> VideoQ["video" queue]
+
+    CeleryQ --> DefaultPool[Default Worker Pool<br/>concurrency = 10<br/>API calls, downloads, DB ops]
+    VideoQ --> VideoPool[Video Worker Pool<br/>concurrency = 100<br/>HTTP dispatch only]
+
+    DefaultPool -->|API calls| Sonauto[Sonauto API]
+    DefaultPool -->|upload/download| GCS[(GCS)]
+    DefaultPool -->|read/write| DB[(PostgreSQL)]
+
+    VideoPool -->|POST /generate| CloudRun[Cloud Run<br/>auto-scaling<br/>video instances]
+    CloudRun -->|read/write| GCS
+```
+
+---
+
+## Cloud Run Video Offloading
+
+### Why Cloud Run?
+
+FFmpeg video rendering is CPU-intensive. Each video takes approximately 10 seconds to render: generating ~45 frames for one rotation cycle, compositing four image layers per frame (background, pet photo, vinyl record, needle), then encoding to H.264. On the main server, running even 5 concurrent FFmpeg processes would consume significant CPU and memory, competing with FastAPI request handling and Celery task execution.
+
+Cloud Run solves this in three ways:
+
+1. **Horizontal auto-scaling**: Each video render request gets its own container instance. Cloud Run provisions instances on demand, so 100 simultaneous renders don't impact each other. There's no shared CPU to contend for.
+
+2. **Isolation from the main backend**: The Celery video worker's job is reduced to a single HTTP POST — it doesn't consume CPU for rendering. The main server remains responsive for API requests, webhook processing, and database operations regardless of video rendering load.
+
+3. **Scale-to-zero economics**: Campaign traffic follows a burst pattern (social shares trigger waves of submissions, then activity drops). Cloud Run scales to zero when idle, so there's no cost during lulls. This is far more cost-efficient than provisioning dedicated compute for peak load.
+
+### Data Flow
+
+GCS serves as the shared data plane between the main backend and Cloud Run. There is no direct file transfer between services:
+
+1. The Celery worker sends `{ session_id, audio_blob_path, photo_blob_path }` to Cloud Run
+2. Cloud Run downloads the audio and photo from GCS
+3. Cloud Run runs FFmpeg to generate the video (720x1280px MP4, 15fps, vinyl record animation)
+4. Cloud Run uploads the video and composite record image to GCS
+5. Cloud Run returns `{ video_blob_path }` to the Celery worker
+6. The Celery worker updates the database with the video path and marks the submission as `success`
+
+### Configuration
+
+```
+CLOUD_RUN_VIDEO_URL=https://video-generator-xxx.a.run.app
+VIDEO_SERVICE_API_KEY=shared-secret-for-auth
+CLOUD_RUN_VIDEO_TIMEOUT=300  # 5 minutes safety ceiling
+```
+
+The 300-second timeout is a safety ceiling — videos normally render in ~10 seconds, but the timeout accounts for cold starts, GCS download time, and edge cases with very long audio tracks.
+
+### Local Development Fallback
+
+When `CLOUD_RUN_VIDEO_URL` is not set, the `create_video` task falls back to local FFmpeg rendering. This allows development and testing without Cloud Run infrastructure. The local path downloads files from GCS to temp files, runs the video generator, uploads results, and cleans up.
+
+---
+
+## Resilience & Failure Handling
+
+### 8.1 Retry Strategy
+
+All API-calling tasks use Celery's built-in exponential backoff retry:
+
+- **Trigger**: `autoretry_for=(requests.RequestException,)` — any HTTP error triggers a retry
+- **Backoff**: Exponential with `retry_backoff=True`, capped at `retry_backoff_max=60` seconds
+- **Max retries**: 3 for API/download tasks, 2 for video creation
+- **Application-level tracking**: The `send_to_sonauto` task also increments a `retry_count` column on the submission. If this count reaches `MAX_RETRIES` (configurable, default 3), the submission is marked as `fail` independently of Celery's retry mechanism. This prevents scenarios where Celery's retry counter resets (e.g., after a worker restart) from causing infinite retries.
+
+### 8.2 Two-Layer Timeout Strategy
+
+Submissions that are sent to Sonauto but never receive a webhook callback need to be detected and marked as failed. The system uses two complementary timeout mechanisms:
+
+**Layer 1 — Event-based per-submission countdown (primary)**
+
+When `send_to_sonauto` successfully calls the Sonauto API, it immediately schedules a `check_submission_timeout` task with a countdown equal to `WEBHOOK_TIMEOUT_MINUTES` (default: 10 minutes):
+
+```python
+check_submission_timeout.apply_async(
+    args=[session_id],
+    countdown=settings.WEBHOOK_TIMEOUT_MINUTES * 60,
+)
+```
+
+This task fires at the exact timeout moment for each submission. If the webhook has already arrived, it's a no-op. If the submission is still in `processing` state, it's marked as `fail` with `LLM_status = "timeout"`.
+
+**Layer 2 — Periodic safety-net sweep (fallback)**
+
+Celery Beat runs `check_timeouts` every 30 minutes. This queries for any submissions where `sent_to_LLM` is older than the timeout threshold and `received_from_LLM` is still NULL. This catches edge cases where the per-submission timeout task itself failed to execute (e.g., the worker restarted, Redis lost the message).
+
+```mermaid
+graph TD
+    Submit[send_to_sonauto<br/>sends to API] --> Schedule[Schedule<br/>check_submission_timeout<br/>countdown = 10 min]
+
+    Schedule --> Timer{10 minutes<br/>elapsed}
+    Timer -->|webhook arrived| NoOp[No-op<br/>already processed]
+    Timer -->|still processing| MarkFail1[Mark as fail<br/>LLM_status = timeout]
+
+    Beat[Celery Beat<br/>every 30 min] --> Sweep[check_timeouts<br/>query stale submissions]
+    Sweep -->|found stale| MarkFail2[Mark as fail<br/>LLM_status = timeout]
+    Sweep -->|none found| Clean[No action needed]
+
+    MarkFail1 --> Failed[Submission<br/>status = fail]
+    MarkFail2 --> Failed
+```
+
+### 8.3 Fail-Fast Startup
+
+The FastAPI application verifies GCS connectivity on startup:
+
+```python
+@app.on_event("startup")
+async def startup_event():
+    from app.services import storage
+    storage.check_connectivity()  # Raises if bucket inaccessible
+```
+
+If the GCS bucket is inaccessible (wrong credentials, bucket doesn't exist, network issue), the application refuses to start. This prevents a scenario where the app accepts submissions but silently fails to store images, leading to broken submissions that can never produce videos.
+
+### 8.4 Health Monitoring
+
+The `/api/health` endpoint performs three checks:
+
+| Check | Method | Unhealthy if |
+|-------|--------|-------------|
+| Redis | `redis_client.ping()` | Connection refused or timeout |
+| PostgreSQL | `SELECT 1` via SQLAlchemy | Connection refused or query fails |
+| Celery workers | `celery_app.control.inspect().active()` | No active workers detected |
+
+If any check fails, the endpoint returns HTTP 503. Load balancers use this to route traffic away from unhealthy instances.
+
+### 8.5 Content Safety
+
+Two layers of content filtering protect against inappropriate submissions:
+
+**Text filtering (client + server):**
+- Client-side: JavaScript profanity check provides instant feedback
+- Server-side: The `contains_profanity()` function uses the `better-profanity` library plus a custom banned-words list. This is the authoritative check — client-side filtering can be bypassed.
+
+**Image classification (Gemini AI):**
+- Configurable via `IMAGE_SAFETY_METHOD` (default: `gemini`, alternative: `nudenet`)
+- Gemini receives the image with a strict classification prompt that only accepts photos with a pet/animal as the primary subject
+- **Fail-closed on content filter**: If Gemini's own safety filter blocks the response, the image is rejected
+- **Fail-open on network error**: If Gemini is unreachable (network error, auth failure), the image is allowed through — availability is prioritized over a rare edge case
+
+### 8.6 Rate Limiting
+
+Each user is identified by a `cookie_id` (generated on first submission, stored in localStorage). The server enforces a maximum of `FORM_SUBMIT_RETRY` (default: 10) submissions per `cookie_id` by counting existing submissions in the database. The client also enforces this limit for immediate feedback, but the server-side check is authoritative.
+
+### 8.7 Database Connection Pooling
+
+```python
+engine = create_engine(
+    settings.DATABASE_URL,
+    pool_pre_ping=True,    # Verify connections before use
+    pool_size=10,          # Base pool size
+    max_overflow=20,       # Additional connections under load
+)
+```
+
+- **`pool_pre_ping=True`**: Before using a pooled connection, SQLAlchemy sends a lightweight ping. If the connection is stale (database restarted, network blip), it's discarded and a fresh one is created. This prevents "connection reset" errors after idle periods.
+- **`pool_size=10`**: Maintains 10 persistent connections. Sufficient for the FastAPI server and Celery workers under normal load.
+- **`max_overflow=20`**: Up to 20 additional connections can be created during traffic bursts, for a total of 30 concurrent connections. These overflow connections are closed after use.
+
+### 8.8 Data Retention
+
+A Celery Beat task runs daily at 3:00 AM UTC:
+
+1. Queries submissions older than `FILE_RETENTION_DAYS` (default: 30 days)
+2. Deletes associated files from GCS: uploaded photo, generated audio, generated video, and composite record image
+3. Logs the count of deleted files and processed submissions
+
+The database records themselves are not deleted — only the associated files. This preserves the audit trail while reclaiming storage.
+
+---
+
+## Data Model
+
+The system uses a single `submissions` table. Fields are grouped by purpose:
+
+### Identity & Tracking
+
+| Column | Type | Purpose |
+|--------|------|---------|
+| `session_id` | `String(255)` PK | CUID2-generated unique identifier |
+| `cookie_id` | `String(255)` indexed | Client-side user identifier for rate limiting |
+| `created_at` | `DateTime` | Submission timestamp |
+
+### Form Data
+
+| Column | Type | Purpose |
+|--------|------|---------|
+| `owner_name` | `String(100)` | Pet owner's name |
+| `pet_name` | `String(100)` | Pet's name |
+| `pet_type` | `String(50)` | Pet species (dog, cat, etc.) |
+| `music_vibe` | `String(50)` | Selected music style |
+| `photo_path` | `String(500)` | GCS blob path to uploaded pet photo |
+
+### Processing State
+
+| Column | Type | Purpose |
+|--------|------|---------|
+| `entry_status` | `String(50)` | Current status: `pending` → `processing` → `success` / `fail` |
+| `retry_count` | `Integer` | Application-level retry counter for API calls |
+
+### External API Tracking
+
+| Column | Type | Purpose |
+|--------|------|---------|
+| `sent_to_LLM` | `DateTime` | When the prompt was sent to Sonauto |
+| `LLM_task_id` | `String(255)` indexed | Sonauto's task identifier for webhook correlation |
+| `received_from_LLM` | `DateTime` | When the webhook callback was received |
+| `LLM_response` | `Text` | Primary song URL from Sonauto |
+| `LLM_full_response` | `Text` | Full JSON response for debugging |
+| `LLM_status` | `String(50)` | Sonauto-specific status (success, fail, timeout) |
+
+### Generated Content
+
+| Column | Type | Purpose |
+|--------|------|---------|
+| `lyrics` | `Text` | AI-generated song lyrics |
+| `generated_song_path` | `Text` | GCS blob path to MP3 audio |
+| `generated_video_path` | `String(500)` | GCS blob path to MP4 video |
+
+### Timing & Streaming
+
+| Column | Type | Purpose |
+|--------|------|---------|
+| `streaming_ready_at` | `DateTime` | When Sonauto's streaming endpoint became available |
+| `video_creation_start` | `DateTime` | When video rendering began |
+| `video_creation_end` | `DateTime` | When video rendering completed |
+
+### Indexes
+
+- `ix_submissions_cookie_id` on `cookie_id` — fast rate-limit lookups
+- `ix_submissions_llm_task_id` on `LLM_task_id` — fast webhook correlation by Sonauto task ID
+
+---
+
+## Infrastructure (Docker Compose)
+
+Six services are defined in `docker-compose.yml`. The four application services (web, two workers, beat) share the same custom Dockerfile. Data stores use official images.
+
+### Service Configuration
+
+**web** — FastAPI application server
+- Ports: `8000:8000`
+- Depends on: `db` (healthy), `redis` (started)
+- Volumes: GCS credentials (read-only)
+
+**celery_worker_default** — Default queue worker
+- Command: `celery -A tasks.celery_app worker --concurrency=10 --queues=celery`
+- Handles: API calls, audio downloads, timeout checks, cleanup
+- Depends on: `db` (healthy), `redis` (started)
+
+**celery_worker_video** — Video queue worker
+- Command: `celery -A tasks.celery_app worker --concurrency=100 --queues=video`
+- Handles: HTTP dispatch to Cloud Run only
+- Additional env: `CLOUD_RUN_VIDEO_URL`, `VIDEO_SERVICE_API_KEY`
+- Depends on: `db` (healthy), `redis` (started)
+
+**celery_beat** — Periodic task scheduler
+- Command: `celery -A tasks.celery_app beat`
+- Schedules: `check_timeouts` (every 30 min), `cleanup_old_files` (daily 3 AM)
+- Depends on: `db` (healthy), `redis` (started)
+
+**db** — PostgreSQL 15
+- Health check: `pg_isready -U pah -d pah` (interval 5s, 5 retries)
+- Persistent volume: `postgres_data`
+
+**redis** — Redis 7 Alpine
+- Persistent volume: `redis_data`
+
+### Dependency Graph
+
+```mermaid
+graph TB
+    subgraph "Application Services"
+        Web[web<br/>FastAPI :8000]
+        WorkerDefault[celery_worker_default<br/>concurrency=10]
+        WorkerVideo[celery_worker_video<br/>concurrency=100]
+        Beat[celery_beat<br/>scheduler]
+    end
+
+    subgraph "Data Stores"
+        DB[(db<br/>PostgreSQL 15<br/>:5432)]
+        Redis[(redis<br/>Redis 7<br/>:6379)]
+    end
+
+    subgraph "External"
+        CloudRun[Cloud Run<br/>Video Service]
+        GCS[(GCS Bucket)]
+    end
+
+    Web -->|depends_on: healthy| DB
+    Web -->|depends_on: started| Redis
+    WorkerDefault -->|depends_on: healthy| DB
+    WorkerDefault -->|depends_on: started| Redis
+    WorkerVideo -->|depends_on: healthy| DB
+    WorkerVideo -->|depends_on: started| Redis
+    Beat -->|depends_on: healthy| DB
+    Beat -->|depends_on: started| Redis
+
+    WorkerVideo -.->|HTTP POST| CloudRun
+    CloudRun -.->|read/write| GCS
+    Web -.->|upload| GCS
+    WorkerDefault -.->|upload/download| GCS
+```
+
+All application services receive the same base environment variables (DATABASE_URL, REDIS_URL, CELERY_BROKER_URL, SONAUTO_API_KEY, WEBHOOK_BASE_URL) plus GCS credentials mounted as a read-only volume. The video worker additionally receives `CLOUD_RUN_VIDEO_URL` and `VIDEO_SERVICE_API_KEY`.
+
+---
+
+## Monitoring & Observability
+
+### Health Endpoint
+
+`GET /api/health` returns a JSON payload with the status of Redis, PostgreSQL, and Celery workers. Returns 200 when all checks pass, 503 otherwise. Designed for load balancer health checks.
+
+### Admin Queue Status
+
+`GET /api/admin/queue-status` returns:
+- Count of `pending` submissions
+- Count of `processing` submissions
+- Sonauto API credit balance (cached in Redis for 15 minutes; fetched live on cache miss)
+
+### Admin Data Endpoint
+
+`GET /api/admin/data` implements DataTables server-side processing (SSP) with search, sorting, and pagination. Supports filtering across `owner_name`, `pet_name`, and `session_id`.
+
+### Structured Logging
+
+All components use Python's `logging` module with a consistent format:
+```
+%(asctime)s - %(name)s - %(levelname)s - %(message)s
+```
+
+Key events logged at INFO level:
+- Submission creation and dispatch
+- Sonauto API calls (send, fetch, response)
+- Webhook receipt and chain triggering
+- GCS uploads/downloads with blob paths and byte counts
+- Video generation start/completion with timing
+- Timeout and failure events (WARNING level)
+
+### Database Timestamp Audit Trail
+
+Each submission records timestamps at critical pipeline stages:
+- `created_at` — when the user submitted the form
+- `sent_to_LLM` — when the prompt was sent to Sonauto
+- `streaming_ready_at` — when early audio playback became available
+- `received_from_LLM` — when the webhook arrived
+- `video_creation_start` / `video_creation_end` — video render duration
+
+This allows post-hoc analysis of pipeline latency at each stage.
+
+---
+
+## Storage Architecture (GCS)
+
+### Bucket Structure
+
+All files reside in a single GCS bucket (`vday2026`) organized by type:
+
+```
+vday2026/
+├── uploads/        # Pet photos (JPEG, ~50-200 KB each)
+│   └── {session_id}.jpg
+├── audio/          # Generated songs (MP3, ~3-5 MB each)
+│   └── {session_id}.mp3
+├── video/          # Generated videos (MP4, ~5-15 MB each)
+│   └── {session_id}.mp4
+└── images/         # Composite record images (PNG)
+    └── {session_id}.png
+```
+
+### Naming Convention
+
+All files use `{session_id}` as the filename — a CUID2 identifier that is unique, URL-safe, and collision-resistant. This eliminates the need for name deduplication logic.
+
+### URL Patterns
+
+- **Public URLs** for streaming/display: `https://storage.googleapis.com/vday2026/{folder}/{session_id}.{ext}`
+- **Signed download URLs** for the download button: Time-limited (60 min), include `Content-Disposition: attachment` header with a human-readable filename (`{pet-name}-love-song.mp4`)
+
+### Client Architecture
+
+The GCS client uses a lazy-loaded singleton pattern:
+
+```python
+_client: Client | None = None
+_bucket: Bucket | None = None
+
+def get_client() -> Client:
+    global _client
+    if _client is None:
+        _client = storage.Client.from_service_account_json(...)
+    return _client
+```
+
+This avoids repeated authentication handshakes while still supporting both service account credentials (Docker/production) and application default credentials (Cloud Run).
+
+---
+
+## Reusability Guide
+
+This architecture can be adapted for future campaigns with different external APIs and post-processing pipelines. Key adaptation points:
+
+- **Swap the external API**: Replace `send_to_sonauto` and the webhook handler with calls to any async API that supports webhook callbacks. The chain pattern (fetch details → download artifact → post-process) is generic.
+
+- **Swap post-processing**: Replace the video generation step with any CPU-intensive output creation (image collage, PDF generation, etc.). The Cloud Run offloading pattern works for any workload that benefits from horizontal scaling.
+
+- **Reuse queue topology**: The two-queue model (default + heavy-processing) is reusable whenever you have a mix of fast tasks (API calls, DB ops) and slow I/O-bound dispatch tasks. Adjust concurrency numbers based on the external service's capacity.
+
+- **Reuse timeout strategy**: The two-layer timeout (event-based countdown + periodic sweep) applies to any webhook-dependent workflow. Adjust `WEBHOOK_TIMEOUT_MINUTES` and the Beat sweep interval to match the external API's expected response time.
+
+- **Adapt rate limiting**: The cookie-based rate limiting is simple but effective for campaign microsites where user accounts don't exist. For authenticated flows, replace `cookie_id` with user IDs.
+
+- **Adapt content safety**: The pluggable image safety backend (Gemini vs NudeNet) pattern can be extended with additional methods. The profanity filter's custom banned-words list can be swapped per campaign.
+
+- **Config-driven via env vars**: All external URLs, API keys, timeouts, concurrency limits, and retention periods are configurable via environment variables. No code changes needed for infrastructure differences between campaigns.
+
+- **GCS as the data plane**: Using object storage as the shared data plane between services (main backend ↔ Cloud Run) avoids the complexity of direct file transfers, shared volumes, or service-to-service streaming. Any service that can read/write to the bucket can participate in the pipeline.
+
+---
+
+## Appendix: Configuration Reference
+
+All settings are defined in `backend/app/config.py` using Pydantic Settings. Values are loaded from environment variables, falling back to defaults.
+
+| Variable | Default | Description |
+|----------|---------|-------------|
+| `DATABASE_URL` | `postgresql://pah:pah_password@localhost:5432/pah` | PostgreSQL connection string |
+| `REDIS_URL` | `redis://localhost:6379/0` | Redis connection URL |
+| `CELERY_BROKER_URL` | `redis://localhost:6379/0` | Celery message broker URL |
+| `CELERY_RESULT_BACKEND` | `redis://localhost:6379/0` | Celery result storage URL |
+| `SONAUTO_API_URL` | `https://api.sonauto.ai/v1` | Sonauto API base URL |
+| `SONAUTO_API_KEY` | *(empty)* | Sonauto API authentication key |
+| `WEBHOOK_BASE_URL` | `http://localhost:8000` | Base URL for webhook callbacks |
+| `MAX_CONCURRENT_REQUESTS` | `10` | Maximum concurrent API requests |
+| `MAX_RETRIES` | `3` | Application-level retry limit |
+| `FORM_SUBMIT_RETRY` | `10` | Maximum submissions per cookie_id |
+| `MIN_AVAILABLE_CREDITS` | `5000` | Minimum Sonauto credits before alert |
+| `WEBHOOK_TIMEOUT_MINUTES` | `10` | Minutes before a submission is considered timed out |
+| `API_REQUEST_TIMEOUT_SECONDS` | `30` | HTTP timeout for outbound API calls |
+| `IMAGE_SAFETY_METHOD` | `gemini` | Image classification backend (`gemini` or `nudenet`) |
+| `GEMINI_API_KEY` | *(empty)* | Google Gemini API key for image safety |
+| `CLOUD_RUN_VIDEO_URL` | *(empty)* | Cloud Run video service URL (empty = local FFmpeg) |
+| `VIDEO_SERVICE_API_KEY` | *(empty)* | Shared secret for Cloud Run authentication |
+| `CLOUD_RUN_VIDEO_TIMEOUT` | `300` | Cloud Run request timeout in seconds |
+| `GCS_PROJECT_ID` | `holiday-project-india` | Google Cloud project ID |
+| `GCS_BUCKET_NAME` | `vday2026` | GCS bucket name |
+| `GCS_CREDENTIALS_PATH` | `google_cloud_storage.json` | Path to GCS service account JSON |
+| `GCS_UPLOADS_FOLDER` | `uploads` | GCS folder for pet photos |
+| `GCS_AUDIO_FOLDER` | `audio` | GCS folder for generated audio |
+| `GCS_VIDEO_FOLDER` | `video` | GCS folder for generated video |
+| `GCS_IMAGES_FOLDER` | `images` | GCS folder for composite images |
+| `STORAGE_BASE` | `../storage` | Local storage path (dev fallback) |
+| `FILE_RETENTION_DAYS` | `30` | Days before files are cleaned up |
+| `CORS_ORIGINS` | `localhost:8000, :8080` | Allowed CORS origins |