Commit graph

5 commits

Author SHA1 Message Date
Vadym Samoilenko
24d93277de fix(deploy): restore original memory limits on ffmpeg/whisper workers
faster_whisper loads its model into RAM at startup regardless of whether
tasks are routed to Cloud Run — reducing the limit to 512M caused OOM kill
on container start. Restored original limits (ffmpeg: 1G, whisper: 2G).

Cloud Run URLs (FFMPEG_SERVICE_URL / WHISPER_SERVICE_URL) remain set so CPU
offload is still active.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-30 14:32:24 +01:00
Vadym Samoilenko
ec1ce5c13a feat(deploy): offload ffmpeg+whisper to Cloud Run HTTP services on optical-dev
Sets FFMPEG_SERVICE_URL and WHISPER_SERVICE_URL so video_renderer.py and
whisper_transcribe.py route CPU-heavy work to Cloud Run instead of running
ffmpeg/Whisper locally. Both Cloud Run services and IAM (roles/run.invoker
for accessible-video-worker@ and video-accessibility@ SAs) are already
provisioned — only the env vars were missing.

ffmpeg-worker container: 1G/0.5CPU → 256M/0.25CPU (HTTP dispatcher only)
whisper-worker container: 2G/0.5CPU → 512M/0.25CPU (HTTP dispatcher only)

Expected outcome: ffmpeg-worker drops from 51% CPU / 97% RAM to < 5% CPU.
Server load avg should fall from ~2.2 to ~1.0-1.3.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-30 14:28:58 +01:00
Vadym Samoilenko
5e55d9f27a fix(deploy): add reservations to workers in optical-dev — prevent limit < reservation OOM error
whisper-worker base has reservation 4G, optical-dev limit 2G causes Docker error.
Added explicit reservations to all three pipeline workers.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-30 12:07:52 +01:00
Vadym Samoilenko
c1948ea198 feat(ux): T-2/PR-7/PR-8 — status color helper, queue stats widget, upload-final-VTT override
T-2: Extract getJobStatusColor() into utils/jobStatusMessages.ts; StatusBadge now uses the
     shared helper (single source of truth for badge colors).

PR-7: GET /admin/production/queue-stats — returns Celery queue depths via Redis LLEN.
      Production dashboard shows a live panel (10s refresh) with per-queue task counts.

PR-8: POST /admin/production/jobs/{id}/upload-final-vtt — Production/Admin can upload a
      hand-crafted VTT to bypass AI, writing to GCS and advancing the job to PENDING_QC.
      Upload modal added to FailuresList with language + type (captions/ad) selectors.

docker-compose.optical-dev.yml: enable USE_CELERY_FALLBACK=true, set worker replicas=1
      for all pipeline workers (ffmpeg/tts/whisper) with WORKER_CONCURRENCY=2 so the full
      pipeline runs on the 2-CPU optical-dev server until Cloud Run VPC Connector is ready.

Fix: remove unused effectiveMs variable in TimelinePreview (TS6133).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-30 11:12:36 +01:00
Vadym Samoilenko
b3ace22009 feat(infra): move heavy workers to Cloud Run Jobs
Heavy pipeline tasks (ingest, translate, render, rerender) now dispatch
to a Cloud Run Job (va-worker) instead of local Celery workers. optical-dev
runs only api + lightweight worker (notify/embed) within its 2-CPU budget.

- backend/app/tasks/runner.py — Cloud Run Job entrypoint
- backend/app/services/cloud_run_dispatch.py — replaces .delay() for heavy tasks
- backend/Dockerfile.cloudrun — Cloud Run worker image (ffmpeg included)
- docker-compose.optical-dev.yml — 2-CPU safe overrides, disables heavy workers
- cloudbuild.yaml — builds va-worker image and updates Cloud Run Job
- deploy-dev.sh — uses 3-file compose, builds only api+worker locally
- routes_jobs, routes_admin_production, ingest_and_ai, translate_and_synthesize
  — all dispatch sites updated to use cloud_run_dispatch.dispatch()

USE_CELERY_FALLBACK=true in .env.local to use Celery locally during dev.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-29 21:47:10 +01:00