vault backup: 2026-04-30 21:42:22

This commit is contained in:
Vadym Samoilenko 2026-04-30 21:42:22 +01:00
parent 522c794f14
commit 3c2d661732
10 changed files with 420 additions and 37 deletions

View file

@ -19,19 +19,19 @@ tags:
AI SaaS платформа для генерації accessibility-матеріалів (CC, AD, SDH) з відео.
- **Директорія:** `/Users/ai_leed/Documents/Projects/Oliver/video-accessibility`
- **Гілка:** `main`
- **Останній коміт (локальний + pushed):** `3bed598` — fix(glossary+jobs): debug logging + AllJobs filter fix
- **Сервер:** `optical-dev` (Docker Compose)
- **Останній коміт:** `3bed598` — fix(glossary+jobs): debug logging + AllJobs filter fix
---
## Що зроблено СЬОГОДНІ (2026-04-30)
### Виправлені баги (всі в `main`)
### Виправлені баги — в коді (в `main`, pushed)
| # | Файл | Проблема | Рішення |
|---|------|----------|---------|
| 1 | `tasks/translate_and_synthesize.py` | `UnboundLocalError: job_doc` на рядку ~976 | Перемістив `find_one(job_id)` перед `gcs_path()` |
| 2 | `migrations/scripts/migration_2026-04-30-000002_fix_status_enum.py` | MongoDB `$jsonSchema` відхиляв статус `cancelled` | Новая міграція з `firstBatch` патерном + повний список статусів |
| 2 | `migrations/scripts/migration_2026-04-30-000002_fix_status_enum.py` | MongoDB `$jsonSchema` відхиляв статус `cancelled` | Нова міграція з `firstBatch` патерном + повний список статусів |
| 3 | `migrations/run.py` | Файл не існував | Створив runner з `connect_to_mongo()` |
| 4 | `services/gemini.py` | Стара модель `gemini-2.5-pro` | Оновлено до `gemini-3.1-pro-preview` |
| 5 | `core/config.py` + `tasks/tts_synthesis.py` | TTS flash модель застаріла | Flash → `gemini-3.1-flash-tts-preview`, Pro залишився `gemini-2.5-pro-preview-tts` |
@ -39,58 +39,74 @@ AI SaaS платформа для генерації accessibility-матері
| 7 | `routes/jobs/JobsList.tsx` | All Jobs показував "no jobs" при дефолтних фільтрах | `useEffect` тепер завжди синхронізує `statusFilter` з URL-параметром (очищає, якщо param відсутній) |
| 8 | `services/glossary_service.py` | Glossary не застосовувалась, помилка ховалась | Детальний дебаг-логінг + guard для `source_term_lower=None` + guard для `target_locale=None` |
### На сервері виконано вручну
### Виправлені баги — на сервері вручну (НЕ в коді)
```bash
# optical-dev
docker compose up -d --build tts-worker # TTS deadlock: 0/15 → фіксовано
python -m app.migrations.run # Applied migration_2026-04-30-000002
```
| # | Що | Чому |
|---|-----|------|
| 9 | `FFMPEG_WORKER_CONCURRENCY=20``4` в `.env.production` на optical-dev | OOM crash-loop: 20 prefork × ~120MB = 2.4GB > 1GB container limit. OS OOM killer вбивав процес з ExitCode=0, OOMKilled=False |
| 10 | `docker compose up -d --build ffmpeg-worker` на сервері | Рестарт з новим concurrency=4; воркер стабілізувався, підхопив 3 задачі з черги |
| 11 | `docker compose up -d --build tts-worker` | TTS deadlock: 0/15 → фіксовано |
| 12 | `python -m app.migrations.run` | Applied migration_2026-04-30-000002 |
### Статус тестового джобу `Test5` (`69f3b6d2cde5f3709e55301e`)
### Фінальний стан тестового джобу `Test5` (`69f3b6d2cde5f3709e55301e`)
- TTS: EN ✓ (15/15), DE-DE ✓ (13/13), FR-CA ✓ (15/15)
- Рендер відео: **в процесі** на момент завершення сесії
- Стан: `tts_generating` → очікується перехід у `rendering_qc` або `pending_final_review`
- Рендер відео: **завершено** — всі 3 `accessible_video.mp4` є в GCS
- Стан: **`pending_qc`** ✓
---
## Що НЕ вирішено — TODO на наступну сесію
### 1. Glossary — причина не знайдена (ПРІОРИТЕТ)
### 1. API crash-loop — Prometheus port conflict (КРИТИЧНО)
**Симптом:** Glossary не застосовується навіть коли джоб створений в правильному проекті з активним словником.
**Симптом:** API контейнер перезапускається. В логах повторюються:
```
Failed to start Prometheus server: [Errno 98] Address already in use
```
Також повторюються TTS initialization повідомлення — ознака що API стартує знову і знову.
**Що зробили:** Додали детальний логінг в `get_glossary_block_for_job`. Тепер у воркер-логах буде видно ТОЧНО де повертається `""`.
**Причина:** Кілька процесів API намагаються прив'язати той самий Prometheus порт. Можливо:
- Кілька Uvicorn workers (multi-process mode) кожен намагається запустити Prometheus
- Попередній процес не встиг вивільнити порт до рестарту
**Наступний крок:**
**Наступні кроки:**
```bash
# На optical-dev:
docker compose logs api --tail=100 | grep -E "Prometheus|ERROR|Errno"
docker inspect accessible-video_api_1 | grep -A5 RestartPolicy
# Перевірити, чи Prometheus запускається тільки в main process:
grep -r "prometheus" backend/app/ | grep -v ".pyc"
```
### 2. Glossary — причина не знайдена
**Симптом:** Glossary не застосовується навіть коли джоб в правильному проекті з активним словником.
**Що зробили:** Додали детальний логінг в `get_glossary_block_for_job`. Після деплою логи покажуть ТОЧНО де повертається `""`.
**Наступні кроки:**
1. Задеплоїти нові зміни (`git pull` + `docker compose up -d --build worker api`)
2. Запустити новий тест-джоб в проекті зі словником
3. Перевірити логи воркера: `docker compose logs -f worker | grep -i glossary`
4. Буде один з варіантів:
3. Перевірити: `docker compose logs -f worker | grep -i glossary`
4. Очікувані варіанти:
- `Glossary skip job=X: no project_id` → джоб не прив'язаний до проекту
- `Glossary skip job=X: project ABC not found``project_id` не матчить жодного проекту (тип?)
- `Glossary skip job=X: no active glossary for client Y` → словник для цього клієнта не активний
- `Glossary skip job=X: no source text` → відсутній `_glossary_source_text` (VTT порожній?)
- `Glossary skip job=X: project ABC not found``project_id` не матчить (тип?)
- `Glossary skip job=X: no active glossary for client Y` → словник не активний
- `Glossary skip job=X: no source text``_glossary_source_text` порожній
- `Glossary lookup failed ... traceback` → виняток з повним стеком
### 2. Деплой фронтенду (ПРІОРИТЕТ)
### 3. Деплой фронтенду (ПРІОРИТЕТ)
Зміни в `Dashboard.tsx` (Processing counter) і `JobsList.tsx` (AllJobs filter) ще не задеплоєні на сервер.
Зміни в `Dashboard.tsx` і `JobsList.tsx` ще **не задеплоєні** на сервер.
```bash
# На optical-dev:
cd /opt/projects/video-accessibility
git pull
docker compose up -d --build api
# або якщо є окремий фронтенд-білд:
./scripts/build-frontend.sh
```
### 3. Перевірити фінальний стан Test5
Перевірити чи джоб успішно завершив рендер відео і перейшов у `pending_final_review` або `completed`.
---
## Архітектурні нотатки (важливо пам'ятати)
@ -98,17 +114,24 @@ docker compose up -d --build api
### 3 окремих Celery-воркери на optical-dev
```yaml
worker: черги default, ingest, notify, render (concurrency=8)
tts-worker: черга tts ONLY (concurrency=10, Cloud Run mode)
ffmpeg-worker: черга ffmpeg (concurrency=20, Cloud Run mode)
worker: черги default, ingest, notify, render (concurrency=8)
tts-worker: черга tts ONLY (concurrency=10, Cloud Run mode)
ffmpeg-worker: черга ffmpeg (concurrency=4 після фіксу OOM)
```
> [!warning] Ключове правило
> Якщо `tts-worker` не перебудований — `synthesize_cue_task` зависне в черзі назавжди (0/N cues). Завжди перебудовувати `tts-worker` після змін у `tts_synthesis.py`.
> [!warning] Ключові правила
> - Якщо `tts-worker` не перебудований — `synthesize_cue_task` зависне в черзі (0/N cues)
> - `ffmpeg-worker` concurrency=4 — більше не піднімати без збільшення ліміту пам'яті контейнера
### Чому ffmpeg задачі НЕ йдуть у Cloud Run
`_dispatch_ffmpeg` і `_dispatch_ffprobe` завжди роутять через локальний Celery `ffmpeg` queue, бо freeze-сегменти — локальні файли в `/shared-tmp`, Cloud Run їх не бачить. Cloud Run використовується тільки для: source video duration, video properties, frame extraction, segment re-encoding.
Перевірити: `video_renderer.py` рядки ~233-300 (`_dispatch_ffmpeg`) і ~711-719 (freeze segment duration завжди local).
### `USE_CELERY_FALLBACK=true` на optical-dev
Коли це встановлено, завдання йдуть у локальний Celery замість Cloud Run. Це потрібно для дебагу.
Коли встановлено — всі завдання йдуть у локальний Celery. `FFMPEG_SERVICE_URL` контролює тільки **окремі** ffmpeg виклики, не весь роутинг задач.
### MongoDB `$jsonSchema` validator
@ -173,3 +196,9 @@ backend/app/migrations/run.py (new file)
frontend/src/routes/Dashboard.tsx (Processing counter fix)
frontend/src/routes/jobs/JobsList.tsx (AllJobs filter fix)
```
**На сервері (тільки `.env.production`, не в git):**
```
/opt/projects/video-accessibility/.env.production
FFMPEG_WORKER_CONCURRENCY=20 → 4
```

View file

@ -842,3 +842,6 @@ tags: [daily]
- 21:33 | `video-accessibility`
- **Asked:** Debugged production queue API 403 error and memory issues | Reduced FFMPEG_WORKER_CONCURRENCY from 20 to 4 to prevent OOM-kill, allowing worker to process all 3 queues and concatenate segments | Worker configuration
- **Done:** FFMPEG worker memory optimization | FFMPEG_WORKER_CONCURRENCY parameter reduced, worker queue processing enabled | Environment configuration, Worker service
- 21:40 (2min) | `video-accessibility`
- **Asked:** Debug 403 Forbidden error on production queue stats API endpoint.
- **Done:** Identified authentication issue with GET request to /video-accessibility/api/v1/admin/production/queue-stats endpoint.

View file

@ -23,8 +23,8 @@ This 3-hop pattern works for hundreds of articles without vector search.
| [[wiki/tech-patterns/_index\|tech-patterns/]] | Recurring tech stacks: FastAPI, React/Vite, Next.js, Azure AD, AI, Box, One2Edit, Redis/Celery, cost-tracker | 17 |
| [[wiki/architecture/_index\|architecture/]] | Cross-cutting architectural patterns: Docker Compose, multi-agent AI, GCP timeout, RAG, hotfolder, optical-dev deploy, cost-tracker, new-project checklist, troubleshooting playbooks, ADR log, Cloud Run Jobs | 11 |
| [[wiki/client-knowledge/_index\|client-knowledge/]] | Per-client notes for Ford, H&M, L'Oréal, Barclays, Ferrero, 3M | 6 |
| [[wiki/concepts/_index\|concepts/]] | Atomic knowledge extracted from Claude Code sessions | 86 |
| [[wiki/connections/_index\|connections/]] | Cross-cutting insights linking 2+ concepts: FastAPI+Azure AD+Docker trinity, AI→cost-tracker, Apache+Vite basePath, GCP→REST polling, Box+hotfolder, Docker DNS+AdGuard | 9 |
| [[wiki/concepts/_index\|concepts/]] | Atomic knowledge extracted from Claude Code sessions | 89 |
| [[wiki/connections/_index\|connections/]] | Cross-cutting insights linking 2+ concepts: FastAPI+Azure AD+Docker trinity, AI→cost-tracker, Apache+Vite basePath, GCP→REST polling, Box+hotfolder, Docker DNS+AdGuard, Celery prefork×faster_whisper memory stacking | 10 |
| [[wiki/qa/_index\|qa/]] | Filed answers to queries (saved with `--file-back`) | 0 |
| [[wiki/homelab/_index\|homelab/]] | Self-hosted infra: Proxmox install, IOMMU/PCI passthrough, hypervisor setup, budget builds, HP Elitedesk G3, Homarr API + Apps + Boards + Certificates + Integrations + Settings + Tasks + AdGuard + Clock + Docker Stats + Docker Integration + Download Client + Firewall + Proxmox Integration + Radarr + Readarr + Sonarr + Bookmarks + Calendar + Icons + App Widget + Weather + GitHub + Nextcloud + qBittorrent + RSS Feed + Speedtest Tracker + System Health Monitoring + System Resources + Services Map + Media Stack | 42 |
| [[wiki/web-agency/_index\|web-agency/]] | AI-assisted website building & selling: Claude Code, Nanobanana 2, Kling, LaunchPath MCP | 9 |

View file

@ -100,5 +100,9 @@
| [[wiki/concepts/celery-queue-worker-specialization]] | Named Celery queues: only the container consuming that queue processes tasks — fix bugs in specialised workers by rebuilding THAT container | daily/2026-04-30.md | 2026-04-30 |
| [[wiki/concepts/gcs-resumable-upload-pattern]] | Browser → backend creates GCS Resumable Session URI → browser uploads chunks directly to GCS, bypassing LB/Apache; 8 MB chunks, 308=continue, resume via Range header | daily/2026-04-30.md | 2026-04-30 |
| [[wiki/concepts/celery-prefork-pool-startup-memory]] | Celery prefork forks ALL CONCURRENCY workers at startup — CONCURRENCY=20 × 120 MB = 2.4 GB before first task; OOM before any work | daily/2026-04-30.md | 2026-04-30 |
| [[wiki/concepts/sudo-git-clone-root-ownership]] | `sudo git clone` makes all files root-owned — subsequent user `git pull` fails with Permission denied on .git/FETCH_HEAD; fix: chown -R | daily/2026-04-30.md | 2026-04-30 |
| [[wiki/concepts/python-fastapi-module-level-singletons]] | `settings = Settings()` at module import level crashes pytest when env vars aren't set — guard with `@lru_cache` function or lazy `@property` | daily/2026-04-30.md | 2026-04-30 |
<!-- Articles added automatically by compile.py -->
<!-- Format: | [[concepts/slug]] | One-line summary | daily/YYYY-MM-DD.md | date | -->

View file

@ -0,0 +1,82 @@
---
title: "Celery Prefork Pool — All Workers Fork at Startup"
aliases:
- celery-prefork-startup-memory
- celery-concurrency-oom
tags:
- celery
- python
- docker
- memory
- worker
sources:
- "daily/2026-04-30.md"
created: 2026-04-30
updated: 2026-04-30
---
# Celery Prefork Pool — All Workers Fork at Startup
Celery's default `prefork` pool forks **all** `CONCURRENCY` worker processes immediately at startup, not lazily on first task. Each forked process loads the full Python interpreter plus all imports. With `CONCURRENCY=20` and 120 MB per process, that is 2.4 GB of RAM consumed before a single task is processed — enough to OOM-kill a container and stall a pipeline for 15+ minutes while the cause is invisible in application logs.
## Key Points
- `prefork` (the default pool type) forks N processes at `celery worker` start time
- Each process is a full Python interpreter with all imports loaded
- Memory = `CONCURRENCY × per_worker_MB` consumed before any task runs
- OOM manifests as the container being killed, not a Python exception
- `CONCURRENCY=1` is safe but eliminates parallelism — tune with the formula below
## Details
### Safe Concurrency Formula
```
CONCURRENCY = floor(container_memory_MB / per_worker_MB)
```
Measure `per_worker_MB` with:
```bash
# Start one worker, check RSS
celery -A app worker --concurrency=1 &
ps aux | grep celery
```
Common baselines (no heavy ML models):
- Pure Python FastAPI worker: ~6080 MB
- Worker that imports `faster_whisper`: ~400800 MB per worker (model loaded per process)
- Worker that imports `torch`: 300500 MB baseline
### Alternative Pool Types
| Pool | Startup behaviour | Use case |
|------|------------------|----------|
| `prefork` (default) | All N processes fork immediately | CPU-bound tasks |
| `solo` | Single-process, no fork | Dev / low-memory containers |
| `gevent` / `eventlet` | Green threads, shared process | I/O-bound tasks |
| `threads` | OS threads, shared process | I/O-bound, simpler than gevent |
Switch via `CELERY_POOL=solo` or `--pool=gevent`.
### Stacking with ML Libraries
If a worker imports a model library at module level (e.g. `faster_whisper`, `torch`), that model is loaded into **every** forked process. With `CONCURRENCY=4` and a 400 MB model, startup RAM = 1.6 GB before any inference runs. See the connection article.
### Symptoms
- Container killed within 1030 seconds of `docker compose up`
- No Python traceback — OOM killer logs in `dmesg` / `docker events`
- `docker stats` shows memory spike to container limit then drop (restart)
- Tasks never start processing; queue builds up
## Related Concepts
- [[wiki/concepts/faster-whisper-startup-memory]] — model loads at startup in each worker process
- [[wiki/connections/celery-prefork-faster-whisper-memory-stacking]] — the combined effect when both apply
- [[wiki/concepts/docker-compose-cpu-limits-env]] — memory limits in Compose override files
- [[wiki/concepts/celery-queue-worker-specialization]] — specialised workers, smaller CONCURRENCY per service
## Sources
- [[daily/2026-04-30.md]] — Session 21:37, ffmpeg-worker OOM diagnosis; CONCURRENCY=20, 2.4 GB pre-task RAM, 15-minute pipeline stall

View file

@ -0,0 +1,97 @@
---
title: "Module-Level Singletons Break pytest — Use Lazy Initialisation"
aliases:
- module-level-settings-pytest
- lazy-singleton-fastapi
- settings-import-time-instantiation
tags:
- python
- fastapi
- pytest
- testing
- pydantic
sources:
- "daily/2026-04-30.md"
created: 2026-04-30
updated: 2026-04-30
---
# Module-Level Singletons Break pytest — Use Lazy Initialisation
Instantiating `Settings()`, `SomeService()`, or any object that reads environment variables at **module import time** causes pytest to fail when those env vars are not set in the test environment — even for tests that never call that module's functions. Python imports all referenced modules on `import`, so `settings = Settings()` at the top of `config.py` runs as soon as any test file imports anything from that package.
## Key Points
- `Settings()` at module level runs at `import` time, not at call time
- pytest imports modules eagerly — a test for `routes/health.py` may trigger `config.py``Settings()``ValidationError`
- The failure looks like a config error, not a test design problem
- Fix: wrap in `@lru_cache` function or `@property` so instantiation is deferred to first use
- Pydantic `BaseSettings` validation runs in `__init__` — there is no "lazy" mode
## Details
### Anti-Pattern
```python
# config.py ← runs at import time
settings = Settings() # crashes if MONGO_URL not set in test env
# service.py
db_service = DatabaseService() # same problem
```
### Fix 1 — `@lru_cache` function (recommended for FastAPI)
```python
from functools import lru_cache
@lru_cache
def get_settings() -> Settings:
return Settings()
# Use as FastAPI dependency
@router.get("/")
async def handler(settings: Settings = Depends(get_settings)):
...
```
Tests can override with `app.dependency_overrides[get_settings] = lambda: FakeSettings()`.
### Fix 2 — `@property` on a config holder
```python
class _Config:
_settings: Settings | None = None
@property
def settings(self) -> Settings:
if self._settings is None:
self._settings = Settings()
return self._settings
config = _Config() # safe — no Settings() call yet
```
### Fix 3 — pytest `monkeypatch` / `.env` file
For tests that genuinely need the real Settings, provide env vars via a `conftest.py`:
```python
@pytest.fixture(autouse=True)
def env_vars(monkeypatch):
monkeypatch.setenv("MONGO_URL", "mongodb://localhost:27017/test")
monkeypatch.setenv("SECRET_KEY", "test-secret")
```
### Why Python 3.14 Makes This Worse
Python 3.14 has no pre-built wheels for Rust-extension packages (`pydantic-core`, `cryptography`). Poetry silently installs a pure-Python fallback that may behave differently or be missing functionality. Always pin `python = "^3.11"` in `pyproject.toml` and run tests in Docker matching the production Python version.
## Related Concepts
- [[wiki/concepts/poetry-docker-version-mismatch]] — Poetry / Python version mismatch causing silent failures
- [[wiki/concepts/time-sleep-blocks-asyncio]] — another class of import-time footgun in async FastAPI
## Sources
- [[daily/2026-04-30.md]] — Session 13:36, test suite fixes; module-level Settings() crashes, aiohttp mock pattern

View file

@ -0,0 +1,74 @@
---
title: "sudo git clone Makes Files Root-Owned — User git Pull Fails"
aliases:
- sudo-git-clone-root-files
- git-permission-denied-fetch-head
tags:
- git
- linux
- server
- permissions
sources:
- "daily/2026-04-30.md"
created: 2026-04-30
updated: 2026-04-30
---
# sudo git clone Makes Files Root-Owned — User git Pull Fails
Running `sudo git clone` on a server creates every file and directory — including the entire `.git/` folder — owned by `root`. Any subsequent `git pull` or `git fetch` run as a regular user fails with `Permission denied` on `.git/FETCH_HEAD` (or similar index files), even though the user can read the working tree.
## Key Points
- `sudo git clone` → all files owned by `root:root`
- `git pull` as a non-root user hits a write permission error on `.git/FETCH_HEAD`
- The error message looks like a network or credential issue but is purely a filesystem ownership problem
- Fix: `sudo chown -R $USER:$USER /opt/project`
- Prevention: never use `sudo` for `git clone` unless the repo must be root-owned
## Details
### The Error
```
error: cannot open .git/FETCH_HEAD: Permission denied
```
or
```
fatal: Unable to create '/opt/project/.git/index.lock': Permission denied
```
### Fix
```bash
sudo chown -R $USER:$USER /opt/project
# Verify
ls -la /opt/project/.git/
```
### Prevention
If deploying to `/opt/` or `/srv/` (root-owned dirs), create the directory first, then clone as the service user:
```bash
sudo mkdir -p /opt/project
sudo chown deploy:deploy /opt/project
git clone git@github.com:org/project.git /opt/project
```
Or use `sudo -u deploy git clone ...` to clone as the deploy user directly.
### Why This Happens
`sudo` switches the effective UID to root. `git clone` creates all files with the current effective UID as owner. There is no `--chown` flag on `git clone`, unlike `docker cp`.
## Related Concepts
- [[wiki/concepts/monorepo-deploy-script-pitfall]] — another class of silent git failure during deploys
- [[wiki/concepts/python-service-deployment-dotenv]] — deploy checklist for Python services
## Sources
- [[daily/2026-04-30.md]] — Session 12:11, re-deploy after project folder deletion; sudo git clone footgun discovered

View file

@ -15,5 +15,7 @@
| [[wiki/connections/box-api-hotfolder-pattern]] | Box API ↔ hotfolder daemon — always paired; archive pattern prevents double-processing | 2026-04-27 | 2026-04-27 |
| [[wiki/connections/docker-dns-adguard-split-horizon]] | Docker DNS ↔ AdGuard split-horizon — Docker containers inherit router DNS, not AdGuard; explicit dns: config required | daily/2026-04-28.md | 2026-04-28 |
| [[wiki/connections/celery-prefork-faster-whisper-memory-stacking]] | Celery prefork fork-all ↔ faster_whisper model-at-startup — CONCURRENCY × model_size GB consumed before first task | daily/2026-04-30.md | 2026-04-30 |
<!-- Articles added automatically by compile.py -->
<!-- Format: | [[connections/slug]] | ConceptA ↔ ConceptB | daily/YYYY-MM-DD.md | date | -->

View file

@ -0,0 +1,86 @@
---
title: "Connection: Celery Prefork × faster_whisper — Memory Stacking"
connects:
- "concepts/celery-prefork-pool-startup-memory"
- "concepts/faster-whisper-startup-memory"
sources:
- "daily/2026-04-30.md"
created: 2026-04-30
updated: 2026-04-30
---
# Connection: Celery Prefork × faster_whisper — Memory Stacking
## The Connection
Two independent startup-memory behaviours combine multiplicatively when `faster_whisper` is imported inside a Celery worker module:
1. **Celery prefork** forks ALL `CONCURRENCY` worker processes at `celery worker` start — each is a full Python interpreter with all imports loaded.
2. **faster_whisper** loads the full transcription model into RAM at import time (when `WhisperModel(...)` is called at module level or in a module-level `@worker_init` signal handler).
Result: `CONCURRENCY=4` with a 400 MB Whisper model = **1.6 GB** consumed before the first transcription task is dequeued.
## Key Insight
> Neither behaviour is a bug in isolation — the danger is invisible until they are combined in the same container.
The `faster-whisper-startup-memory` article documents the per-container model loading cost. The `celery-prefork-pool-startup-memory` article documents the per-worker process forking cost. When they stack, the formula becomes:
```
total_startup_RAM = CONCURRENCY × (base_worker_MB + model_size_MB)
```
Example with `large-v3` model (~1.5 GB) and `CONCURRENCY=4`:
```
4 × (80 MB interpreter + 1500 MB model) = 6.3 GB before first task
```
A container with a 4 GB memory limit is OOM-killed before it processes anything.
## Evidence
- Session 21:37 (2026-04-30): ffmpeg-worker with `CONCURRENCY=20`, ~120 MB/process → 2.4 GB, container OOM-killed, 15-minute pipeline stall
- The stall was compounded because Celery silently retries tasks that were in-flight when the worker died, creating a second wave of OOM on restart
## Solutions
### Option A — Reduce concurrency to match model size
```
CONCURRENCY = floor(container_memory_MB / (base_MB + model_MB))
```
### Option B — Separate transcription into its own single-worker container
Keep `CONCURRENCY=1` for the whisper worker, scale by adding containers, not by increasing CONCURRENCY. Each container has exactly one model copy.
### Option C — Load model lazily (inside the task, not at import)
```python
_model = None
@app.task
def transcribe(audio_path: str):
global _model
if _model is None:
_model = WhisperModel("large-v3")
return _model.transcribe(audio_path)
```
Downside: first task in each process pays the load latency (~515 s). Subsequent tasks in the same process reuse the loaded model.
### Option D — Use `solo` or `threads` pool
`CELERY_POOL=solo` runs tasks in the main process with no forking — only one model copy regardless of logical concurrency. Appropriate for GPU workers where parallelism is handled at the GPU level.
## Related Concepts
- [[wiki/concepts/celery-prefork-pool-startup-memory]] — Celery fork-all-at-startup behaviour
- [[wiki/concepts/faster-whisper-startup-memory]] — model loaded at container start
- [[wiki/concepts/celery-queue-worker-specialization]] — isolating whisper work to dedicated containers
- [[wiki/concepts/docker-compose-cpu-limits-env]] — setting memory limits in Compose
## Sources
- [[daily/2026-04-30.md]] — Session 21:37, Celery ffmpeg-worker OOM; identified as combined prefork + model-loading issue

View file

@ -1,6 +1,12 @@
# Build Log
## [2026-04-30T23:30:00+01:00] compile | 2026-04-30.md (pass 2)
- Source: daily/2026-04-30.md
- Articles created: [[wiki/concepts/celery-prefork-pool-startup-memory]], [[wiki/concepts/sudo-git-clone-root-ownership]], [[wiki/concepts/python-fastapi-module-level-singletons]], [[wiki/connections/celery-prefork-faster-whisper-memory-stacking]]
- Articles updated: (none)
- Index updates: [[wiki/concepts/_index]] (86→89); [[wiki/connections/_index]] (9→10); [[wiki/_master-index]] (concepts 86→89, connections 9→10)
## [2026-04-30T21:00:00+01:00] compile | 2026-04-30.md
- Source: daily/2026-04-30.md
- Articles created: [[wiki/concepts/pydub-ffmpeg-silent-dependency]], [[wiki/concepts/lameenc-bytearray-gcs-upload]], [[wiki/concepts/apache-mod-alias-proxy-priority]], [[wiki/concepts/faster-whisper-startup-memory]], [[wiki/concepts/celery-redis-queue-flush-on-deterministic-error]], [[wiki/concepts/cline-lm-studio-openai-compatible]], [[wiki/concepts/celery-queue-worker-specialization]], [[wiki/concepts/gcs-resumable-upload-pattern]]