# Runbook — Accessible Video Processing Platform **Generated:** 2026-05-01 --- ## Quick Navigation - [Docs Hub](../README.md) - [Infrastructure](infrastructure.md) - [Architecture](architecture.md) - [Local Dev Setup](#1-local-development-setup) - [Deployment](#2-deployment-optical-web-1) - [Service Operations](#3-service-operations) - [Troubleshooting](#4-troubleshooting) - [Environment Variables](#5-environment-variables) ## Agent Entry | Signal | Value | |--------|-------| | Purpose | Step-by-step procedures for running, deploying, and troubleshooting the platform | | Read When | Local setup, deployment, restart, or incident diagnosis | | Skip When | You need architecture understanding → architecture.md; inventory → infrastructure.md | | Canonical | Yes | | Next Docs | [Infrastructure](infrastructure.md), [Architecture](architecture.md) | | Primary Sources | `scripts/run-local.sh`, `docker-compose.yml`, `.env.example` | --- ## 1. Local Development Setup ### Prerequisites - Docker Desktop (with `docker compose` v2) - Node.js 20+ and npm - GCP credentials JSON at `secrets/gcp-credentials.json` - `.env.local` file (copy from `.env.example`, fill secrets) ### Backend (Docker) ```bash # Start all backend services (API, workers, MongoDB, Redis) ./scripts/run-local.sh # Force image rebuild after code changes ./scripts/run-local.sh --rebuild # Stop all services ./scripts/run-local.sh --stop # Restart ./scripts/run-local.sh --restart ``` The script uses `docker-compose.yml` + `docker-compose.local.yml` with `.env.local`. After startup: - API: `http://localhost:8012` - Swagger UI: `http://localhost:8012/docs` ### Frontend (Vite dev server) ```bash cd frontend npm install npm run dev ``` Frontend runs on `http://localhost:5173` by default. ### Run Migrations ```bash docker exec -it accessible-video-api python migrate.py ``` ### Create Test Users ```bash docker exec -it accessible-video-api python create_test_users.py ``` --- ## 2. Deployment (optical-web-1) > **RULE:** Never SSH into optical-web-1 or run commands on it without explicit user instruction. ### Deploy Script ```bash ./scripts/deploy-dev.sh ``` ### Frontend Build ```bash ./scripts/build-frontend.sh ``` Builds the React SPA and copies `dist/` to the nginx serving directory. ### Production Environment File Production uses the `.env` file on optical-web-1. Key differences from `.env.example`: | Variable | Production value | |----------|-----------------| | `APP_ENV` | `production` | | `COOKIE_SECURE` | `true` | | `COOKIE_DOMAIN` | `optical-dev.oliver.solutions` | | All API keys | Real secret values | --- ## 3. Service Operations ### View Logs ```bash docker logs accessible-video-api -f --tail=100 docker logs accessible-video-worker -f --tail=100 docker logs accessible-video-tts-worker -f --tail=100 docker logs accessible-video-ffmpeg-worker -f --tail=100 docker logs accessible-video-whisper-worker -f --tail=100 ``` ### Restart a Single Service ```bash docker compose restart api docker compose restart worker docker compose restart tts-worker docker compose restart ffmpeg-worker docker compose restart whisper-worker ``` ### Restart All Services ```bash docker compose down && docker compose up -d ``` ### Rebuild a Single Service ```bash docker compose build api && docker compose up -d api docker compose build worker && docker compose up -d worker ``` ### Check Running Services ```bash docker compose ps ``` ### Check Queue Depths ```bash # Via API (requires admin token) GET /api/v1/production/queue-stats # Via Redis CLI docker exec -it accessible-video-redis redis-cli llen celery ``` --- ## 4. Troubleshooting ### TTS Worker Crash Loop (Memory) **Symptom:** `tts-worker` container restarts; OOM errors in logs. **Cause:** `TTS_WORKER_CONCURRENCY` × per-process memory exceeds available RAM. **Fix:** Lower `TTS_WORKER_CONCURRENCY` in `.env` (recommended: 2 for 512 MB containers), then: ```bash docker compose stop tts-worker # edit .env: TTS_WORKER_CONCURRENCY=2 docker compose up -d tts-worker ``` ### Whisper Worker OOM **Symptom:** `whisper-worker` killed with exit code 137. **Cause:** Whisper `large-v3` requires ~4–6 GB RAM; container limit is 8 GB. **Fix:** Ensure host has sufficient free RAM, or switch to Cloud Run mode via `WHISPER_SERVICE_URL`. ### Stuck Jobs **Symptom:** Job stays in `ingesting` or `ai_processing` indefinitely. **Steps:** 1. Check worker logs for errors 2. Admin API: `POST /api/v1/admin/maintenance/reprocess-job/{job_id}` 3. Or: `POST /api/v1/jobs/{job_id}/retry` ### MongoDB Connection Failure **Symptom:** API returns 500; logs show `ServerSelectionTimeoutError`. **Steps:** 1. `docker compose ps` — check mongodb container status 2. `docker logs accessible-video-mongodb --tail=50` 3. Confirm `MONGODB_URI` in `.env` matches the running container ### Redis Connection Failure **Symptom:** Celery tasks not executing; `redis.exceptions.ConnectionError` in logs. **Steps:** 1. `docker exec -it accessible-video-redis redis-cli ping` — should return `PONG` 2. `docker compose restart redis` 3. `docker compose restart worker tts-worker ffmpeg-worker whisper-worker` ### GCS Access Denied **Symptom:** `403 Forbidden` from GCS; files not uploading. **Steps:** 1. Verify `secrets/gcp-credentials.json` exists and is bind-mounted 2. Confirm service account has `Storage Object Admin` on `GCS_BUCKET` 3. Check `GCP_PROJECT_ID` and `GCS_BUCKET` in `.env` ### Celery Worker Not Processing Queue **Symptom:** Jobs queued but workers idle. **Steps:** 1. `docker compose ps` — check worker containers running 2. Check worker logs for import errors at startup 3. Verify `CELERY_BROKER_URL` resolves to Redis within the compose network --- ### WebSocket Disconnects / Reconnect Storms (optical-web-1) **Symptom:** Users experience frequent WebSocket disconnections followed by rapid reconnect attempts visible in browser DevTools Network tab. **Root cause:** Apache `mod_proxy_wstunnel` on optical-web-1 has a `ProxyTimeout` that drops idle WebSocket connections. The client ping interval (20 s) and server keepalive frame (20 s) are designed to prevent this, but only if Apache's timeout is above 20 s. **Recommended Apache config** (verify with DevOps before applying): ```apache # In the VirtualHost block for the API ProxyTimeout 60 ``` > **Do not set ProxyTimeout below 30 s.** The Mod Comms 2026-03-18 incident showed that 25 s was insufficient through mod_proxy_wstunnel — the idle timer fires on the _proxy_ side before the client ping arrives. 60 s provides a comfortable margin above the 20 s bidirectional keepalive cadence. **Verification after change:** 1. Open DevTools → Network → WS tab 2. Connect to any job and let it sit idle for 2 minutes 3. Confirm no `close` frames and no reconnect attempts appear --- ## 5. Environment Variables Copy from `.env.example`. All variables are required unless marked optional. | Variable | Default | Required | Description | |----------|---------|----------|-------------| | `APP_ENV` | `dev` | Yes | `dev` or `production` | | `API_BASE_URL` | — | Yes | Public API base URL | | `JWT_SECRET` | — | **Yes** | Random secret; rotation invalidates all sessions | | `JWT_ALG` | `HS256` | No | JWT signing algorithm | | `JWT_ACCESS_TTL_MIN` | `240` | No | Access token TTL (minutes) | | `JWT_REFRESH_TTL_DAYS` | `7` | No | Refresh token TTL (days) | | `COOKIE_DOMAIN` | `optical-dev.oliver.solutions` | Yes | Refresh cookie domain | | `COOKIE_SECURE` | `true` | No | Set `false` for local HTTP | | `COOKIE_SAMESITE` | `Lax` | No | | | `MONGODB_URI` | — | Yes | MongoDB connection string | | `MONGODB_DB` | `accessible_video` | No | Database name | | `REDIS_URL` | `redis://redis:6379/0` | Yes | | | `CELERY_BROKER_URL` | `redis://redis:6379/0` | Yes | Same as REDIS_URL | | `CELERY_RESULT_BACKEND` | `redis://redis:6379/0` | Yes | | | `GCP_PROJECT_ID` | — | Yes | GCP project ID | | `GCS_BUCKET` | `accessible-video` | Yes | GCS bucket name | | `GOOGLE_APPLICATION_CREDENTIALS` | `/secrets/gcp-credentials.json` | Yes | Path to service account JSON | | `GEMINI_API_KEY` | — | Yes | Gemini 2.5 Pro API key | | `TRANSLATE_API_KEY` | — | No | Google Translate API key | | `ELEVENLABS_API_KEY` | — | No | ElevenLabs API key | | `GOOGLE_TTS_CREDENTIALS` | `/secrets/gcp-credentials.json` | No | Separate TTS credentials if needed | | `SENDGRID_API_KEY` | — | No | SendGrid API key | | `EMAIL_FROM` | `noreply@optical-dev.oliver.solutions` | No | Sender address | | `CLIENT_BASE_URL` | — | No | Frontend URL for email links | | `AZURE_CLIENT_ID` | — | No | Microsoft SSO client ID | | `AZURE_AUTHORITY` | — | No | Microsoft tenant authority URL | | `AZURE_REDIRECT_URI` | — | No | Microsoft OIDC redirect URI | | `CORS_ORIGINS` | localhost variants | Yes | Comma-separated allowed origins | | `SENTRY_DSN` | — | No | Sentry DSN | | `OTEL_EXPORTER_OTLP_ENDPOINT` | — | No | OpenTelemetry collector endpoint | | `COST_TRACKER_BASE_URL` | — | No | AI cost tracker API URL | | `COST_TRACKER_API_KEY` | — | No | AI cost tracker API key | | `COST_TRACKER_SOURCE_APP` | `video-accessibility` | No | App identifier | | `COST_TRACKER_ENABLED` | `true` | No | Enable/disable cost tracking | | `WORKER_CONCURRENCY` | `8` | No | General worker concurrency | | `TTS_WORKER_CONCURRENCY` | `2` | No | TTS worker concurrency | | `FFMPEG_WORKER_CONCURRENCY` | `1` | No | FFmpeg worker concurrency | | `WHISPER_WORKER_CONCURRENCY` | `1` | No | Whisper worker concurrency | | `FFMPEG_SERVICE_URL` | — | No | Cloud Run FFmpeg service URL | | `WHISPER_SERVICE_URL` | — | No | Cloud Run Whisper service URL | | `WHISPER_MODEL` | `medium` | No | Whisper model size | | `USE_CELERY_FALLBACK` | `false` | No | Force local Celery instead of Cloud Run | --- ## 6. Rollback ### Code Rollback Check out the previous commit and rebuild: ```bash git log --oneline -10 git checkout docker compose build && docker compose up -d ``` ### JWT Secret Rotation 1. Generate: `openssl rand -hex 32` 2. Update `JWT_SECRET` in `.env` 3. `docker compose restart api` 4. All existing sessions are invalidated — users must re-login --- ## Maintenance **Last Updated:** 2026-05-01 **Update Triggers:** - New script added to `scripts/` - Deployment target changes - New environment variable required - New Docker service added **Verification:** - [ ] `./scripts/run-local.sh` flags match actual script - [ ] Environment variable table complete vs `.env.example` - [ ] Worker env var names match `docker-compose.yml` - [ ] Troubleshooting container names match compose service names