video-accessibility/docs/project/runbook.md
Vadym Samoilenko 2f4925353a feat(pause-insert): adaptive buffer, forward-snap, timeline drag + share link fix
Backend (Phase A):
- A1: Adaptive silence buffer — natural_gap_ms persisted per cue; renderer computes
  per-cue silence_before/silence_after instead of fixed 500ms; per-cue silence files
- A2: Forward-preferred snap — snap_pause_point prefers boundaries up to 4s ahead
  over boundaries within 1.5s behind, reducing mid-scene cuts
- A3: Min-gap validation — pause points with < 200ms gap trigger forward search
  to the next acceptable gap
- natural_gap_ms added to PausePointData model and api.ts type
- New config fields: whisper_snap_forward_window, whisper_snap_backward_window,
  ad_silence_buffer_default, ad_silence_buffer_min_after, ad_min_acceptable_gap
- Tests: test_whisper_snap.py (13 tests), test_video_renderer_buffers.py

Frontend (Phase B):
- B1: Drag pause-point markers — pointer state machine with 3px move threshold,
  clamp to min/max bounds, click-without-move still opens PausePointEditor
- B2: Drag freeze blocks — orange blocks translate with linked pause point
- B3: Time tooltip visible during drag, hidden on release
- Tests: TimelinePreview.drag.test.tsx (10 tests)

Fixes:
- Share link pointed to ai-sandbox.oliver.solutions — added app_url to Settings
  with correct optical-dev.oliver.solutions default; share_url now configurable
  via APP_URL env var
- Removed all ai-sandbox.oliver.solutions references from docker-compose,
  apache config, docs, and scripts

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-01 16:09:09 +01:00

11 KiB
Raw Blame History

Runbook — Accessible Video Processing Platform

Generated: 2026-05-01


Quick Navigation

Agent Entry

Signal Value
Purpose Step-by-step procedures for running, deploying, and troubleshooting the platform
Read When Local setup, deployment, restart, or incident diagnosis
Skip When You need architecture understanding → architecture.md; inventory → infrastructure.md
Canonical Yes
Next Docs Infrastructure, Architecture
Primary Sources scripts/run-local.sh, docker-compose.yml, .env.example

1. Local Development Setup

Prerequisites

  • Docker Desktop (with docker compose v2)
  • Node.js 20+ and npm
  • GCP credentials JSON at secrets/gcp-credentials.json
  • .env.local file (copy from .env.example, fill secrets)

Backend (Docker)

# Start all backend services (API, workers, MongoDB, Redis)
./scripts/run-local.sh

# Force image rebuild after code changes
./scripts/run-local.sh --rebuild

# Stop all services
./scripts/run-local.sh --stop

# Restart
./scripts/run-local.sh --restart

The script uses docker-compose.yml + docker-compose.local.yml with .env.local.

After startup:

  • API: http://localhost:8012
  • Swagger UI: http://localhost:8012/docs

Frontend (Vite dev server)

cd frontend
npm install
npm run dev

Frontend runs on http://localhost:5173 by default.

Run Migrations

docker exec -it accessible-video-api python migrate.py

Create Test Users

docker exec -it accessible-video-api python create_test_users.py

2. Deployment (optical-web-1)

RULE: Never SSH into optical-web-1 or run commands on it without explicit user instruction.

Deploy Script

./scripts/deploy-dev.sh

Frontend Build

./scripts/build-frontend.sh

Builds the React SPA and copies dist/ to the nginx serving directory.

Production Environment File

Production uses the .env file on optical-web-1. Key differences from .env.example:

Variable Production value
APP_ENV production
COOKIE_SECURE true
COOKIE_DOMAIN optical-dev.oliver.solutions
All API keys Real secret values

3. Service Operations

View Logs

docker logs accessible-video-api -f --tail=100
docker logs accessible-video-worker -f --tail=100
docker logs accessible-video-tts-worker -f --tail=100
docker logs accessible-video-ffmpeg-worker -f --tail=100
docker logs accessible-video-whisper-worker -f --tail=100

Restart a Single Service

docker compose restart api
docker compose restart worker
docker compose restart tts-worker
docker compose restart ffmpeg-worker
docker compose restart whisper-worker

Restart All Services

docker compose down && docker compose up -d

Rebuild a Single Service

docker compose build api && docker compose up -d api
docker compose build worker && docker compose up -d worker

Check Running Services

docker compose ps

Check Queue Depths

# Via API (requires admin token)
GET /api/v1/production/queue-stats

# Via Redis CLI
docker exec -it accessible-video-redis redis-cli llen celery

4. Troubleshooting

TTS Worker Crash Loop (Memory)

Symptom: tts-worker container restarts; OOM errors in logs.

Cause: TTS_WORKER_CONCURRENCY × per-process memory exceeds available RAM.

Fix: Lower TTS_WORKER_CONCURRENCY in .env (recommended: 2 for 512 MB containers), then:

docker compose stop tts-worker
# edit .env: TTS_WORKER_CONCURRENCY=2
docker compose up -d tts-worker

Whisper Worker OOM

Symptom: whisper-worker killed with exit code 137.

Cause: Whisper large-v3 requires ~46 GB RAM; container limit is 8 GB.

Fix: Ensure host has sufficient free RAM, or switch to Cloud Run mode via WHISPER_SERVICE_URL.

Stuck Jobs

Symptom: Job stays in ingesting or ai_processing indefinitely.

Steps:

  1. Check worker logs for errors
  2. Admin API: POST /api/v1/admin/maintenance/reprocess-job/{job_id}
  3. Or: POST /api/v1/jobs/{job_id}/retry

MongoDB Connection Failure

Symptom: API returns 500; logs show ServerSelectionTimeoutError.

Steps:

  1. docker compose ps — check mongodb container status
  2. docker logs accessible-video-mongodb --tail=50
  3. Confirm MONGODB_URI in .env matches the running container

Redis Connection Failure

Symptom: Celery tasks not executing; redis.exceptions.ConnectionError in logs.

Steps:

  1. docker exec -it accessible-video-redis redis-cli ping — should return PONG
  2. docker compose restart redis
  3. docker compose restart worker tts-worker ffmpeg-worker whisper-worker

GCS Access Denied

Symptom: 403 Forbidden from GCS; files not uploading.

Steps:

  1. Verify secrets/gcp-credentials.json exists and is bind-mounted
  2. Confirm service account has Storage Object Admin on GCS_BUCKET
  3. Check GCP_PROJECT_ID and GCS_BUCKET in .env

Celery Worker Not Processing Queue

Symptom: Jobs queued but workers idle.

Steps:

  1. docker compose ps — check worker containers running
  2. Check worker logs for import errors at startup
  3. Verify CELERY_BROKER_URL resolves to Redis within the compose network

WebSocket Disconnects / Reconnect Storms (optical-web-1)

Symptom: Users experience frequent WebSocket disconnections followed by rapid reconnect attempts visible in browser DevTools Network tab.

Root cause: Apache mod_proxy_wstunnel on optical-web-1 has a ProxyTimeout that drops idle WebSocket connections. The client ping interval (20 s) and server keepalive frame (20 s) are designed to prevent this, but only if Apache's timeout is above 20 s.

Recommended Apache config (verify with DevOps before applying):

# In the VirtualHost block for the API
ProxyTimeout 60

Do not set ProxyTimeout below 30 s. The Mod Comms 2026-03-18 incident showed that 25 s was insufficient through mod_proxy_wstunnel — the idle timer fires on the proxy side before the client ping arrives. 60 s provides a comfortable margin above the 20 s bidirectional keepalive cadence.

Verification after change:

  1. Open DevTools → Network → WS tab
  2. Connect to any job and let it sit idle for 2 minutes
  3. Confirm no close frames and no reconnect attempts appear

5. Environment Variables

Copy from .env.example. All variables are required unless marked optional.

Variable Default Required Description
APP_ENV dev Yes dev or production
API_BASE_URL Yes Public API base URL
JWT_SECRET Yes Random secret; rotation invalidates all sessions
JWT_ALG HS256 No JWT signing algorithm
JWT_ACCESS_TTL_MIN 240 No Access token TTL (minutes)
JWT_REFRESH_TTL_DAYS 7 No Refresh token TTL (days)
COOKIE_DOMAIN optical-dev.oliver.solutions Yes Refresh cookie domain
COOKIE_SECURE true No Set false for local HTTP
COOKIE_SAMESITE Lax No
MONGODB_URI Yes MongoDB connection string
MONGODB_DB accessible_video No Database name
REDIS_URL redis://redis:6379/0 Yes
CELERY_BROKER_URL redis://redis:6379/0 Yes Same as REDIS_URL
CELERY_RESULT_BACKEND redis://redis:6379/0 Yes
GCP_PROJECT_ID Yes GCP project ID
GCS_BUCKET accessible-video Yes GCS bucket name
GOOGLE_APPLICATION_CREDENTIALS /secrets/gcp-credentials.json Yes Path to service account JSON
GEMINI_API_KEY Yes Gemini 2.5 Pro API key
TRANSLATE_API_KEY No Google Translate API key
ELEVENLABS_API_KEY No ElevenLabs API key
GOOGLE_TTS_CREDENTIALS /secrets/gcp-credentials.json No Separate TTS credentials if needed
SENDGRID_API_KEY No SendGrid API key
EMAIL_FROM noreply@optical-dev.oliver.solutions No Sender address
CLIENT_BASE_URL No Frontend URL for email links
AZURE_CLIENT_ID No Microsoft SSO client ID
AZURE_AUTHORITY No Microsoft tenant authority URL
AZURE_REDIRECT_URI No Microsoft OIDC redirect URI
CORS_ORIGINS localhost variants Yes Comma-separated allowed origins
SENTRY_DSN No Sentry DSN
OTEL_EXPORTER_OTLP_ENDPOINT No OpenTelemetry collector endpoint
COST_TRACKER_BASE_URL No AI cost tracker API URL
COST_TRACKER_API_KEY No AI cost tracker API key
COST_TRACKER_SOURCE_APP video-accessibility No App identifier
COST_TRACKER_ENABLED true No Enable/disable cost tracking
WORKER_CONCURRENCY 8 No General worker concurrency
TTS_WORKER_CONCURRENCY 2 No TTS worker concurrency
FFMPEG_WORKER_CONCURRENCY 1 No FFmpeg worker concurrency
WHISPER_WORKER_CONCURRENCY 1 No Whisper worker concurrency
FFMPEG_SERVICE_URL No Cloud Run FFmpeg service URL
WHISPER_SERVICE_URL No Cloud Run Whisper service URL
WHISPER_MODEL medium No Whisper model size
USE_CELERY_FALLBACK false No Force local Celery instead of Cloud Run

6. Rollback

Code Rollback

Check out the previous commit and rebuild:

git log --oneline -10
git checkout <previous-commit>
docker compose build && docker compose up -d

JWT Secret Rotation

  1. Generate: openssl rand -hex 32
  2. Update JWT_SECRET in .env
  3. docker compose restart api
  4. All existing sessions are invalidated — users must re-login

Maintenance

Last Updated: 2026-05-01

Update Triggers:

  • New script added to scripts/
  • Deployment target changes
  • New environment variable required
  • New Docker service added

Verification:

  • ./scripts/run-local.sh flags match actual script
  • Environment variable table complete vs .env.example
  • Worker env var names match docker-compose.yml
  • Troubleshooting container names match compose service names