video-accessibility/docs/project/runbook.md
Vadym Samoilenko 2f4925353a feat(pause-insert): adaptive buffer, forward-snap, timeline drag + share link fix
Backend (Phase A):
- A1: Adaptive silence buffer — natural_gap_ms persisted per cue; renderer computes
  per-cue silence_before/silence_after instead of fixed 500ms; per-cue silence files
- A2: Forward-preferred snap — snap_pause_point prefers boundaries up to 4s ahead
  over boundaries within 1.5s behind, reducing mid-scene cuts
- A3: Min-gap validation — pause points with < 200ms gap trigger forward search
  to the next acceptable gap
- natural_gap_ms added to PausePointData model and api.ts type
- New config fields: whisper_snap_forward_window, whisper_snap_backward_window,
  ad_silence_buffer_default, ad_silence_buffer_min_after, ad_min_acceptable_gap
- Tests: test_whisper_snap.py (13 tests), test_video_renderer_buffers.py

Frontend (Phase B):
- B1: Drag pause-point markers — pointer state machine with 3px move threshold,
  clamp to min/max bounds, click-without-move still opens PausePointEditor
- B2: Drag freeze blocks — orange blocks translate with linked pause point
- B3: Time tooltip visible during drag, hidden on release
- Tests: TimelinePreview.drag.test.tsx (10 tests)

Fixes:
- Share link pointed to ai-sandbox.oliver.solutions — added app_url to Settings
  with correct optical-dev.oliver.solutions default; share_url now configurable
  via APP_URL env var
- Removed all ai-sandbox.oliver.solutions references from docker-compose,
  apache config, docs, and scripts

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-01 16:09:09 +01:00

356 lines
11 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Runbook — Accessible Video Processing Platform
<!-- SCOPE: Operational procedures — local dev setup, deployment, service restart, troubleshooting, rollback. No architecture rationale (see architecture.md). -->
<!-- DOC_KIND: how-to -->
<!-- DOC_ROLE: canonical -->
<!-- READ_WHEN: Read when setting up locally, deploying to optical-web-1, restarting services, or diagnosing an incident. -->
<!-- SKIP_WHEN: Skip when you need architecture understanding → architecture.md; infrastructure inventory → infrastructure.md. -->
<!-- PRIMARY_SOURCES: scripts/run-local.sh, docker-compose.yml, .env.example, scripts/deploy-dev.sh -->
**Generated:** 2026-05-01
---
## Quick Navigation
- [Docs Hub](../README.md)
- [Infrastructure](infrastructure.md)
- [Architecture](architecture.md)
- [Local Dev Setup](#1-local-development-setup)
- [Deployment](#2-deployment-optical-web-1)
- [Service Operations](#3-service-operations)
- [Troubleshooting](#4-troubleshooting)
- [Environment Variables](#5-environment-variables)
## Agent Entry
| Signal | Value |
|--------|-------|
| Purpose | Step-by-step procedures for running, deploying, and troubleshooting the platform |
| Read When | Local setup, deployment, restart, or incident diagnosis |
| Skip When | You need architecture understanding → architecture.md; inventory → infrastructure.md |
| Canonical | Yes |
| Next Docs | [Infrastructure](infrastructure.md), [Architecture](architecture.md) |
| Primary Sources | `scripts/run-local.sh`, `docker-compose.yml`, `.env.example` |
---
## 1. Local Development Setup
### Prerequisites
- Docker Desktop (with `docker compose` v2)
- Node.js 20+ and npm
- GCP credentials JSON at `secrets/gcp-credentials.json`
- `.env.local` file (copy from `.env.example`, fill secrets)
### Backend (Docker)
```bash
# Start all backend services (API, workers, MongoDB, Redis)
./scripts/run-local.sh
# Force image rebuild after code changes
./scripts/run-local.sh --rebuild
# Stop all services
./scripts/run-local.sh --stop
# Restart
./scripts/run-local.sh --restart
```
The script uses `docker-compose.yml` + `docker-compose.local.yml` with `.env.local`.
After startup:
- API: `http://localhost:8012`
- Swagger UI: `http://localhost:8012/docs`
### Frontend (Vite dev server)
```bash
cd frontend
npm install
npm run dev
```
Frontend runs on `http://localhost:5173` by default.
### Run Migrations
```bash
docker exec -it accessible-video-api python migrate.py
```
### Create Test Users
```bash
docker exec -it accessible-video-api python create_test_users.py
```
---
## 2. Deployment (optical-web-1)
> **RULE:** Never SSH into optical-web-1 or run commands on it without explicit user instruction.
### Deploy Script
```bash
./scripts/deploy-dev.sh
```
### Frontend Build
```bash
./scripts/build-frontend.sh
```
Builds the React SPA and copies `dist/` to the nginx serving directory.
### Production Environment File
Production uses the `.env` file on optical-web-1. Key differences from `.env.example`:
| Variable | Production value |
|----------|-----------------|
| `APP_ENV` | `production` |
| `COOKIE_SECURE` | `true` |
| `COOKIE_DOMAIN` | `optical-dev.oliver.solutions` |
| All API keys | Real secret values |
---
## 3. Service Operations
### View Logs
```bash
docker logs accessible-video-api -f --tail=100
docker logs accessible-video-worker -f --tail=100
docker logs accessible-video-tts-worker -f --tail=100
docker logs accessible-video-ffmpeg-worker -f --tail=100
docker logs accessible-video-whisper-worker -f --tail=100
```
### Restart a Single Service
```bash
docker compose restart api
docker compose restart worker
docker compose restart tts-worker
docker compose restart ffmpeg-worker
docker compose restart whisper-worker
```
### Restart All Services
```bash
docker compose down && docker compose up -d
```
### Rebuild a Single Service
```bash
docker compose build api && docker compose up -d api
docker compose build worker && docker compose up -d worker
```
### Check Running Services
```bash
docker compose ps
```
### Check Queue Depths
```bash
# Via API (requires admin token)
GET /api/v1/production/queue-stats
# Via Redis CLI
docker exec -it accessible-video-redis redis-cli llen celery
```
---
## 4. Troubleshooting
### TTS Worker Crash Loop (Memory)
**Symptom:** `tts-worker` container restarts; OOM errors in logs.
**Cause:** `TTS_WORKER_CONCURRENCY` × per-process memory exceeds available RAM.
**Fix:** Lower `TTS_WORKER_CONCURRENCY` in `.env` (recommended: 2 for 512 MB containers), then:
```bash
docker compose stop tts-worker
# edit .env: TTS_WORKER_CONCURRENCY=2
docker compose up -d tts-worker
```
### Whisper Worker OOM
**Symptom:** `whisper-worker` killed with exit code 137.
**Cause:** Whisper `large-v3` requires ~46 GB RAM; container limit is 8 GB.
**Fix:** Ensure host has sufficient free RAM, or switch to Cloud Run mode via `WHISPER_SERVICE_URL`.
### Stuck Jobs
**Symptom:** Job stays in `ingesting` or `ai_processing` indefinitely.
**Steps:**
1. Check worker logs for errors
2. Admin API: `POST /api/v1/admin/maintenance/reprocess-job/{job_id}`
3. Or: `POST /api/v1/jobs/{job_id}/retry`
### MongoDB Connection Failure
**Symptom:** API returns 500; logs show `ServerSelectionTimeoutError`.
**Steps:**
1. `docker compose ps` — check mongodb container status
2. `docker logs accessible-video-mongodb --tail=50`
3. Confirm `MONGODB_URI` in `.env` matches the running container
### Redis Connection Failure
**Symptom:** Celery tasks not executing; `redis.exceptions.ConnectionError` in logs.
**Steps:**
1. `docker exec -it accessible-video-redis redis-cli ping` — should return `PONG`
2. `docker compose restart redis`
3. `docker compose restart worker tts-worker ffmpeg-worker whisper-worker`
### GCS Access Denied
**Symptom:** `403 Forbidden` from GCS; files not uploading.
**Steps:**
1. Verify `secrets/gcp-credentials.json` exists and is bind-mounted
2. Confirm service account has `Storage Object Admin` on `GCS_BUCKET`
3. Check `GCP_PROJECT_ID` and `GCS_BUCKET` in `.env`
### Celery Worker Not Processing Queue
**Symptom:** Jobs queued but workers idle.
**Steps:**
1. `docker compose ps` — check worker containers running
2. Check worker logs for import errors at startup
3. Verify `CELERY_BROKER_URL` resolves to Redis within the compose network
---
### WebSocket Disconnects / Reconnect Storms (optical-web-1)
**Symptom:** Users experience frequent WebSocket disconnections followed by rapid reconnect attempts visible in browser DevTools Network tab.
**Root cause:** Apache `mod_proxy_wstunnel` on optical-web-1 has a `ProxyTimeout` that drops idle WebSocket connections. The client ping interval (20 s) and server keepalive frame (20 s) are designed to prevent this, but only if Apache's timeout is above 20 s.
**Recommended Apache config** (verify with DevOps before applying):
```apache
# In the VirtualHost block for the API
ProxyTimeout 60
```
> **Do not set ProxyTimeout below 30 s.** The Mod Comms 2026-03-18 incident showed that 25 s was insufficient through mod_proxy_wstunnel — the idle timer fires on the _proxy_ side before the client ping arrives. 60 s provides a comfortable margin above the 20 s bidirectional keepalive cadence.
**Verification after change:**
1. Open DevTools → Network → WS tab
2. Connect to any job and let it sit idle for 2 minutes
3. Confirm no `close` frames and no reconnect attempts appear
---
## 5. Environment Variables
Copy from `.env.example`. All variables are required unless marked optional.
| Variable | Default | Required | Description |
|----------|---------|----------|-------------|
| `APP_ENV` | `dev` | Yes | `dev` or `production` |
| `API_BASE_URL` | — | Yes | Public API base URL |
| `JWT_SECRET` | — | **Yes** | Random secret; rotation invalidates all sessions |
| `JWT_ALG` | `HS256` | No | JWT signing algorithm |
| `JWT_ACCESS_TTL_MIN` | `240` | No | Access token TTL (minutes) |
| `JWT_REFRESH_TTL_DAYS` | `7` | No | Refresh token TTL (days) |
| `COOKIE_DOMAIN` | `optical-dev.oliver.solutions` | Yes | Refresh cookie domain |
| `COOKIE_SECURE` | `true` | No | Set `false` for local HTTP |
| `COOKIE_SAMESITE` | `Lax` | No | |
| `MONGODB_URI` | — | Yes | MongoDB connection string |
| `MONGODB_DB` | `accessible_video` | No | Database name |
| `REDIS_URL` | `redis://redis:6379/0` | Yes | |
| `CELERY_BROKER_URL` | `redis://redis:6379/0` | Yes | Same as REDIS_URL |
| `CELERY_RESULT_BACKEND` | `redis://redis:6379/0` | Yes | |
| `GCP_PROJECT_ID` | — | Yes | GCP project ID |
| `GCS_BUCKET` | `accessible-video` | Yes | GCS bucket name |
| `GOOGLE_APPLICATION_CREDENTIALS` | `/secrets/gcp-credentials.json` | Yes | Path to service account JSON |
| `GEMINI_API_KEY` | — | Yes | Gemini 2.5 Pro API key |
| `TRANSLATE_API_KEY` | — | No | Google Translate API key |
| `ELEVENLABS_API_KEY` | — | No | ElevenLabs API key |
| `GOOGLE_TTS_CREDENTIALS` | `/secrets/gcp-credentials.json` | No | Separate TTS credentials if needed |
| `SENDGRID_API_KEY` | — | No | SendGrid API key |
| `EMAIL_FROM` | `noreply@optical-dev.oliver.solutions` | No | Sender address |
| `CLIENT_BASE_URL` | — | No | Frontend URL for email links |
| `AZURE_CLIENT_ID` | — | No | Microsoft SSO client ID |
| `AZURE_AUTHORITY` | — | No | Microsoft tenant authority URL |
| `AZURE_REDIRECT_URI` | — | No | Microsoft OIDC redirect URI |
| `CORS_ORIGINS` | localhost variants | Yes | Comma-separated allowed origins |
| `SENTRY_DSN` | — | No | Sentry DSN |
| `OTEL_EXPORTER_OTLP_ENDPOINT` | — | No | OpenTelemetry collector endpoint |
| `COST_TRACKER_BASE_URL` | — | No | AI cost tracker API URL |
| `COST_TRACKER_API_KEY` | — | No | AI cost tracker API key |
| `COST_TRACKER_SOURCE_APP` | `video-accessibility` | No | App identifier |
| `COST_TRACKER_ENABLED` | `true` | No | Enable/disable cost tracking |
| `WORKER_CONCURRENCY` | `8` | No | General worker concurrency |
| `TTS_WORKER_CONCURRENCY` | `2` | No | TTS worker concurrency |
| `FFMPEG_WORKER_CONCURRENCY` | `1` | No | FFmpeg worker concurrency |
| `WHISPER_WORKER_CONCURRENCY` | `1` | No | Whisper worker concurrency |
| `FFMPEG_SERVICE_URL` | — | No | Cloud Run FFmpeg service URL |
| `WHISPER_SERVICE_URL` | — | No | Cloud Run Whisper service URL |
| `WHISPER_MODEL` | `medium` | No | Whisper model size |
| `USE_CELERY_FALLBACK` | `false` | No | Force local Celery instead of Cloud Run |
---
## 6. Rollback
### Code Rollback
Check out the previous commit and rebuild:
```bash
git log --oneline -10
git checkout <previous-commit>
docker compose build && docker compose up -d
```
### JWT Secret Rotation
1. Generate: `openssl rand -hex 32`
2. Update `JWT_SECRET` in `.env`
3. `docker compose restart api`
4. All existing sessions are invalidated — users must re-login
---
## Maintenance
**Last Updated:** 2026-05-01
**Update Triggers:**
- New script added to `scripts/`
- Deployment target changes
- New environment variable required
- New Docker service added
**Verification:**
- [ ] `./scripts/run-local.sh` flags match actual script
- [ ] Environment variable table complete vs `.env.example`
- [ ] Worker env var names match `docker-compose.yml`
- [ ] Troubleshooting container names match compose service names