From 3c2d6617324aa52898e997ea7c3a860cb890dae9 Mon Sep 17 00:00:00 2001 From: Vadym Samoilenko Date: Thu, 30 Apr 2026 21:42:22 +0100 Subject: [PATCH] vault backup: 2026-04-30 21:42:22 --- .../Next-Session-Prompt.md | 99 ++++++++++++------- 99 Daily/2026-04-30.md | 3 + wiki/_master-index.md | 4 +- wiki/concepts/_index.md | 4 + .../celery-prefork-pool-startup-memory.md | 82 +++++++++++++++ .../python-fastapi-module-level-singletons.md | 97 ++++++++++++++++++ .../concepts/sudo-git-clone-root-ownership.md | 74 ++++++++++++++ wiki/connections/_index.md | 2 + ...-prefork-faster-whisper-memory-stacking.md | 86 ++++++++++++++++ wiki/log.md | 6 ++ 10 files changed, 420 insertions(+), 37 deletions(-) create mode 100644 wiki/concepts/celery-prefork-pool-startup-memory.md create mode 100644 wiki/concepts/python-fastapi-module-level-singletons.md create mode 100644 wiki/concepts/sudo-git-clone-root-ownership.md create mode 100644 wiki/connections/celery-prefork-faster-whisper-memory-stacking.md diff --git a/01 Projects/video-accessibility/Next-Session-Prompt.md b/01 Projects/video-accessibility/Next-Session-Prompt.md index 479390e..9090522 100644 --- a/01 Projects/video-accessibility/Next-Session-Prompt.md +++ b/01 Projects/video-accessibility/Next-Session-Prompt.md @@ -19,19 +19,19 @@ tags: AI SaaS платформа для генерації accessibility-матеріалів (CC, AD, SDH) з відео. - **Директорія:** `/Users/ai_leed/Documents/Projects/Oliver/video-accessibility` - **Гілка:** `main` +- **Останній коміт (локальний + pushed):** `3bed598` — fix(glossary+jobs): debug logging + AllJobs filter fix - **Сервер:** `optical-dev` (Docker Compose) -- **Останній коміт:** `3bed598` — fix(glossary+jobs): debug logging + AllJobs filter fix --- ## Що зроблено СЬОГОДНІ (2026-04-30) -### Виправлені баги (всі в `main`) +### Виправлені баги — в коді (в `main`, pushed) | # | Файл | Проблема | Рішення | |---|------|----------|---------| | 1 | `tasks/translate_and_synthesize.py` | `UnboundLocalError: job_doc` на рядку ~976 | Перемістив `find_one(job_id)` перед `gcs_path()` | -| 2 | `migrations/scripts/migration_2026-04-30-000002_fix_status_enum.py` | MongoDB `$jsonSchema` відхиляв статус `cancelled` | Новая міграція з `firstBatch` патерном + повний список статусів | +| 2 | `migrations/scripts/migration_2026-04-30-000002_fix_status_enum.py` | MongoDB `$jsonSchema` відхиляв статус `cancelled` | Нова міграція з `firstBatch` патерном + повний список статусів | | 3 | `migrations/run.py` | Файл не існував | Створив runner з `connect_to_mongo()` | | 4 | `services/gemini.py` | Стара модель `gemini-2.5-pro` | Оновлено до `gemini-3.1-pro-preview` | | 5 | `core/config.py` + `tasks/tts_synthesis.py` | TTS flash модель застаріла | Flash → `gemini-3.1-flash-tts-preview`, Pro залишився `gemini-2.5-pro-preview-tts` | @@ -39,58 +39,74 @@ AI SaaS платформа для генерації accessibility-матері | 7 | `routes/jobs/JobsList.tsx` | All Jobs показував "no jobs" при дефолтних фільтрах | `useEffect` тепер завжди синхронізує `statusFilter` з URL-параметром (очищає, якщо param відсутній) | | 8 | `services/glossary_service.py` | Glossary не застосовувалась, помилка ховалась | Детальний дебаг-логінг + guard для `source_term_lower=None` + guard для `target_locale=None` | -### На сервері виконано вручну +### Виправлені баги — на сервері вручну (НЕ в коді) -```bash -# optical-dev -docker compose up -d --build tts-worker # TTS deadlock: 0/15 → фіксовано -python -m app.migrations.run # Applied migration_2026-04-30-000002 -``` +| # | Що | Чому | +|---|-----|------| +| 9 | `FFMPEG_WORKER_CONCURRENCY=20` → `4` в `.env.production` на optical-dev | OOM crash-loop: 20 prefork × ~120MB = 2.4GB > 1GB container limit. OS OOM killer вбивав процес з ExitCode=0, OOMKilled=False | +| 10 | `docker compose up -d --build ffmpeg-worker` на сервері | Рестарт з новим concurrency=4; воркер стабілізувався, підхопив 3 задачі з черги | +| 11 | `docker compose up -d --build tts-worker` | TTS deadlock: 0/15 → фіксовано | +| 12 | `python -m app.migrations.run` | Applied migration_2026-04-30-000002 | -### Статус тестового джобу `Test5` (`69f3b6d2cde5f3709e55301e`) +### Фінальний стан тестового джобу `Test5` (`69f3b6d2cde5f3709e55301e`) - TTS: EN ✓ (15/15), DE-DE ✓ (13/13), FR-CA ✓ (15/15) -- Рендер відео: **в процесі** на момент завершення сесії -- Стан: `tts_generating` → очікується перехід у `rendering_qc` або `pending_final_review` +- Рендер відео: **завершено** — всі 3 `accessible_video.mp4` є в GCS +- Стан: **`pending_qc`** ✓ --- ## Що НЕ вирішено — TODO на наступну сесію -### 1. Glossary — причина не знайдена (ПРІОРИТЕТ) +### 1. API crash-loop — Prometheus port conflict (КРИТИЧНО) -**Симптом:** Glossary не застосовується навіть коли джоб створений в правильному проекті з активним словником. +**Симптом:** API контейнер перезапускається. В логах повторюються: +``` +Failed to start Prometheus server: [Errno 98] Address already in use +``` +Також повторюються TTS initialization повідомлення — ознака що API стартує знову і знову. -**Що зробили:** Додали детальний логінг в `get_glossary_block_for_job`. Тепер у воркер-логах буде видно ТОЧНО де повертається `""`. +**Причина:** Кілька процесів API намагаються прив'язати той самий Prometheus порт. Можливо: +- Кілька Uvicorn workers (multi-process mode) кожен намагається запустити Prometheus +- Попередній процес не встиг вивільнити порт до рестарту -**Наступний крок:** +**Наступні кроки:** +```bash +# На optical-dev: +docker compose logs api --tail=100 | grep -E "Prometheus|ERROR|Errno" +docker inspect accessible-video_api_1 | grep -A5 RestartPolicy +# Перевірити, чи Prometheus запускається тільки в main process: +grep -r "prometheus" backend/app/ | grep -v ".pyc" +``` + +### 2. Glossary — причина не знайдена + +**Симптом:** Glossary не застосовується навіть коли джоб в правильному проекті з активним словником. + +**Що зробили:** Додали детальний логінг в `get_glossary_block_for_job`. Після деплою логи покажуть ТОЧНО де повертається `""`. + +**Наступні кроки:** 1. Задеплоїти нові зміни (`git pull` + `docker compose up -d --build worker api`) 2. Запустити новий тест-джоб в проекті зі словником -3. Перевірити логи воркера: `docker compose logs -f worker | grep -i glossary` -4. Буде один з варіантів: +3. Перевірити: `docker compose logs -f worker | grep -i glossary` +4. Очікувані варіанти: - `Glossary skip job=X: no project_id` → джоб не прив'язаний до проекту - - `Glossary skip job=X: project ABC not found` → `project_id` не матчить жодного проекту (тип?) - - `Glossary skip job=X: no active glossary for client Y` → словник для цього клієнта не активний - - `Glossary skip job=X: no source text` → відсутній `_glossary_source_text` (VTT порожній?) + - `Glossary skip job=X: project ABC not found` → `project_id` не матчить (тип?) + - `Glossary skip job=X: no active glossary for client Y` → словник не активний + - `Glossary skip job=X: no source text` → `_glossary_source_text` порожній - `Glossary lookup failed ... traceback` → виняток з повним стеком -### 2. Деплой фронтенду (ПРІОРИТЕТ) +### 3. Деплой фронтенду (ПРІОРИТЕТ) -Зміни в `Dashboard.tsx` (Processing counter) і `JobsList.tsx` (AllJobs filter) ще не задеплоєні на сервер. +Зміни в `Dashboard.tsx` і `JobsList.tsx` ще **не задеплоєні** на сервер. ```bash # На optical-dev: cd /opt/projects/video-accessibility git pull docker compose up -d --build api -# або якщо є окремий фронтенд-білд: -./scripts/build-frontend.sh ``` -### 3. Перевірити фінальний стан Test5 - -Перевірити чи джоб успішно завершив рендер відео і перейшов у `pending_final_review` або `completed`. - --- ## Архітектурні нотатки (важливо пам'ятати) @@ -98,17 +114,24 @@ docker compose up -d --build api ### 3 окремих Celery-воркери на optical-dev ```yaml -worker: черги default, ingest, notify, render (concurrency=8) -tts-worker: черга tts ONLY (concurrency=10, Cloud Run mode) -ffmpeg-worker: черга ffmpeg (concurrency=20, Cloud Run mode) +worker: черги default, ingest, notify, render (concurrency=8) +tts-worker: черга tts ONLY (concurrency=10, Cloud Run mode) +ffmpeg-worker: черга ffmpeg (concurrency=4 після фіксу OOM) ``` -> [!warning] Ключове правило -> Якщо `tts-worker` не перебудований — `synthesize_cue_task` зависне в черзі назавжди (0/N cues). Завжди перебудовувати `tts-worker` після змін у `tts_synthesis.py`. +> [!warning] Ключові правила +> - Якщо `tts-worker` не перебудований — `synthesize_cue_task` зависне в черзі (0/N cues) +> - `ffmpeg-worker` concurrency=4 — більше не піднімати без збільшення ліміту пам'яті контейнера + +### Чому ffmpeg задачі НЕ йдуть у Cloud Run + +`_dispatch_ffmpeg` і `_dispatch_ffprobe` завжди роутять через локальний Celery `ffmpeg` queue, бо freeze-сегменти — локальні файли в `/shared-tmp`, Cloud Run їх не бачить. Cloud Run використовується тільки для: source video duration, video properties, frame extraction, segment re-encoding. + +Перевірити: `video_renderer.py` рядки ~233-300 (`_dispatch_ffmpeg`) і ~711-719 (freeze segment duration завжди local). ### `USE_CELERY_FALLBACK=true` на optical-dev -Коли це встановлено, завдання йдуть у локальний Celery замість Cloud Run. Це потрібно для дебагу. +Коли встановлено — всі завдання йдуть у локальний Celery. `FFMPEG_SERVICE_URL` контролює тільки **окремі** ffmpeg виклики, не весь роутинг задач. ### MongoDB `$jsonSchema` validator @@ -173,3 +196,9 @@ backend/app/migrations/run.py (new file) frontend/src/routes/Dashboard.tsx (Processing counter fix) frontend/src/routes/jobs/JobsList.tsx (AllJobs filter fix) ``` + +**На сервері (тільки `.env.production`, не в git):** +``` +/opt/projects/video-accessibility/.env.production + FFMPEG_WORKER_CONCURRENCY=20 → 4 +``` diff --git a/99 Daily/2026-04-30.md b/99 Daily/2026-04-30.md index 96854d4..6f61c73 100644 --- a/99 Daily/2026-04-30.md +++ b/99 Daily/2026-04-30.md @@ -842,3 +842,6 @@ tags: [daily] - 21:33 | `video-accessibility` - **Asked:** Debugged production queue API 403 error and memory issues | Reduced FFMPEG_WORKER_CONCURRENCY from 20 to 4 to prevent OOM-kill, allowing worker to process all 3 queues and concatenate segments | Worker configuration - **Done:** FFMPEG worker memory optimization | FFMPEG_WORKER_CONCURRENCY parameter reduced, worker queue processing enabled | Environment configuration, Worker service +- 21:40 (2min) | `video-accessibility` + - **Asked:** Debug 403 Forbidden error on production queue stats API endpoint. + - **Done:** Identified authentication issue with GET request to /video-accessibility/api/v1/admin/production/queue-stats endpoint. diff --git a/wiki/_master-index.md b/wiki/_master-index.md index a5f05b9..de337bb 100644 --- a/wiki/_master-index.md +++ b/wiki/_master-index.md @@ -23,8 +23,8 @@ This 3-hop pattern works for hundreds of articles without vector search. | [[wiki/tech-patterns/_index\|tech-patterns/]] | Recurring tech stacks: FastAPI, React/Vite, Next.js, Azure AD, AI, Box, One2Edit, Redis/Celery, cost-tracker | 17 | | [[wiki/architecture/_index\|architecture/]] | Cross-cutting architectural patterns: Docker Compose, multi-agent AI, GCP timeout, RAG, hotfolder, optical-dev deploy, cost-tracker, new-project checklist, troubleshooting playbooks, ADR log, Cloud Run Jobs | 11 | | [[wiki/client-knowledge/_index\|client-knowledge/]] | Per-client notes for Ford, H&M, L'Oréal, Barclays, Ferrero, 3M | 6 | -| [[wiki/concepts/_index\|concepts/]] | Atomic knowledge extracted from Claude Code sessions | 86 | -| [[wiki/connections/_index\|connections/]] | Cross-cutting insights linking 2+ concepts: FastAPI+Azure AD+Docker trinity, AI→cost-tracker, Apache+Vite basePath, GCP→REST polling, Box+hotfolder, Docker DNS+AdGuard | 9 | +| [[wiki/concepts/_index\|concepts/]] | Atomic knowledge extracted from Claude Code sessions | 89 | +| [[wiki/connections/_index\|connections/]] | Cross-cutting insights linking 2+ concepts: FastAPI+Azure AD+Docker trinity, AI→cost-tracker, Apache+Vite basePath, GCP→REST polling, Box+hotfolder, Docker DNS+AdGuard, Celery prefork×faster_whisper memory stacking | 10 | | [[wiki/qa/_index\|qa/]] | Filed answers to queries (saved with `--file-back`) | 0 | | [[wiki/homelab/_index\|homelab/]] | Self-hosted infra: Proxmox install, IOMMU/PCI passthrough, hypervisor setup, budget builds, HP Elitedesk G3, Homarr API + Apps + Boards + Certificates + Integrations + Settings + Tasks + AdGuard + Clock + Docker Stats + Docker Integration + Download Client + Firewall + Proxmox Integration + Radarr + Readarr + Sonarr + Bookmarks + Calendar + Icons + App Widget + Weather + GitHub + Nextcloud + qBittorrent + RSS Feed + Speedtest Tracker + System Health Monitoring + System Resources + Services Map + Media Stack | 42 | | [[wiki/web-agency/_index\|web-agency/]] | AI-assisted website building & selling: Claude Code, Nanobanana 2, Kling, LaunchPath MCP | 9 | diff --git a/wiki/concepts/_index.md b/wiki/concepts/_index.md index eb7380c..1a63d02 100644 --- a/wiki/concepts/_index.md +++ b/wiki/concepts/_index.md @@ -100,5 +100,9 @@ | [[wiki/concepts/celery-queue-worker-specialization]] | Named Celery queues: only the container consuming that queue processes tasks — fix bugs in specialised workers by rebuilding THAT container | daily/2026-04-30.md | 2026-04-30 | | [[wiki/concepts/gcs-resumable-upload-pattern]] | Browser → backend creates GCS Resumable Session URI → browser uploads chunks directly to GCS, bypassing LB/Apache; 8 MB chunks, 308=continue, resume via Range header | daily/2026-04-30.md | 2026-04-30 | +| [[wiki/concepts/celery-prefork-pool-startup-memory]] | Celery prefork forks ALL CONCURRENCY workers at startup — CONCURRENCY=20 × 120 MB = 2.4 GB before first task; OOM before any work | daily/2026-04-30.md | 2026-04-30 | +| [[wiki/concepts/sudo-git-clone-root-ownership]] | `sudo git clone` makes all files root-owned — subsequent user `git pull` fails with Permission denied on .git/FETCH_HEAD; fix: chown -R | daily/2026-04-30.md | 2026-04-30 | +| [[wiki/concepts/python-fastapi-module-level-singletons]] | `settings = Settings()` at module import level crashes pytest when env vars aren't set — guard with `@lru_cache` function or lazy `@property` | daily/2026-04-30.md | 2026-04-30 | + diff --git a/wiki/concepts/celery-prefork-pool-startup-memory.md b/wiki/concepts/celery-prefork-pool-startup-memory.md new file mode 100644 index 0000000..1d50c90 --- /dev/null +++ b/wiki/concepts/celery-prefork-pool-startup-memory.md @@ -0,0 +1,82 @@ +--- +title: "Celery Prefork Pool — All Workers Fork at Startup" +aliases: + - celery-prefork-startup-memory + - celery-concurrency-oom +tags: + - celery + - python + - docker + - memory + - worker +sources: + - "daily/2026-04-30.md" +created: 2026-04-30 +updated: 2026-04-30 +--- + +# Celery Prefork Pool — All Workers Fork at Startup + +Celery's default `prefork` pool forks **all** `CONCURRENCY` worker processes immediately at startup, not lazily on first task. Each forked process loads the full Python interpreter plus all imports. With `CONCURRENCY=20` and 120 MB per process, that is 2.4 GB of RAM consumed before a single task is processed — enough to OOM-kill a container and stall a pipeline for 15+ minutes while the cause is invisible in application logs. + +## Key Points + +- `prefork` (the default pool type) forks N processes at `celery worker` start time +- Each process is a full Python interpreter with all imports loaded +- Memory = `CONCURRENCY × per_worker_MB` consumed before any task runs +- OOM manifests as the container being killed, not a Python exception +- `CONCURRENCY=1` is safe but eliminates parallelism — tune with the formula below + +## Details + +### Safe Concurrency Formula + +``` +CONCURRENCY = floor(container_memory_MB / per_worker_MB) +``` + +Measure `per_worker_MB` with: + +```bash +# Start one worker, check RSS +celery -A app worker --concurrency=1 & +ps aux | grep celery +``` + +Common baselines (no heavy ML models): +- Pure Python FastAPI worker: ~60–80 MB +- Worker that imports `faster_whisper`: ~400–800 MB per worker (model loaded per process) +- Worker that imports `torch`: 300–500 MB baseline + +### Alternative Pool Types + +| Pool | Startup behaviour | Use case | +|------|------------------|----------| +| `prefork` (default) | All N processes fork immediately | CPU-bound tasks | +| `solo` | Single-process, no fork | Dev / low-memory containers | +| `gevent` / `eventlet` | Green threads, shared process | I/O-bound tasks | +| `threads` | OS threads, shared process | I/O-bound, simpler than gevent | + +Switch via `CELERY_POOL=solo` or `--pool=gevent`. + +### Stacking with ML Libraries + +If a worker imports a model library at module level (e.g. `faster_whisper`, `torch`), that model is loaded into **every** forked process. With `CONCURRENCY=4` and a 400 MB model, startup RAM = 1.6 GB before any inference runs. See the connection article. + +### Symptoms + +- Container killed within 10–30 seconds of `docker compose up` +- No Python traceback — OOM killer logs in `dmesg` / `docker events` +- `docker stats` shows memory spike to container limit then drop (restart) +- Tasks never start processing; queue builds up + +## Related Concepts + +- [[wiki/concepts/faster-whisper-startup-memory]] — model loads at startup in each worker process +- [[wiki/connections/celery-prefork-faster-whisper-memory-stacking]] — the combined effect when both apply +- [[wiki/concepts/docker-compose-cpu-limits-env]] — memory limits in Compose override files +- [[wiki/concepts/celery-queue-worker-specialization]] — specialised workers, smaller CONCURRENCY per service + +## Sources + +- [[daily/2026-04-30.md]] — Session 21:37, ffmpeg-worker OOM diagnosis; CONCURRENCY=20, 2.4 GB pre-task RAM, 15-minute pipeline stall diff --git a/wiki/concepts/python-fastapi-module-level-singletons.md b/wiki/concepts/python-fastapi-module-level-singletons.md new file mode 100644 index 0000000..e5725cd --- /dev/null +++ b/wiki/concepts/python-fastapi-module-level-singletons.md @@ -0,0 +1,97 @@ +--- +title: "Module-Level Singletons Break pytest — Use Lazy Initialisation" +aliases: + - module-level-settings-pytest + - lazy-singleton-fastapi + - settings-import-time-instantiation +tags: + - python + - fastapi + - pytest + - testing + - pydantic +sources: + - "daily/2026-04-30.md" +created: 2026-04-30 +updated: 2026-04-30 +--- + +# Module-Level Singletons Break pytest — Use Lazy Initialisation + +Instantiating `Settings()`, `SomeService()`, or any object that reads environment variables at **module import time** causes pytest to fail when those env vars are not set in the test environment — even for tests that never call that module's functions. Python imports all referenced modules on `import`, so `settings = Settings()` at the top of `config.py` runs as soon as any test file imports anything from that package. + +## Key Points + +- `Settings()` at module level runs at `import` time, not at call time +- pytest imports modules eagerly — a test for `routes/health.py` may trigger `config.py` → `Settings()` → `ValidationError` +- The failure looks like a config error, not a test design problem +- Fix: wrap in `@lru_cache` function or `@property` so instantiation is deferred to first use +- Pydantic `BaseSettings` validation runs in `__init__` — there is no "lazy" mode + +## Details + +### Anti-Pattern + +```python +# config.py ← runs at import time +settings = Settings() # crashes if MONGO_URL not set in test env + +# service.py +db_service = DatabaseService() # same problem +``` + +### Fix 1 — `@lru_cache` function (recommended for FastAPI) + +```python +from functools import lru_cache + +@lru_cache +def get_settings() -> Settings: + return Settings() + +# Use as FastAPI dependency +@router.get("/") +async def handler(settings: Settings = Depends(get_settings)): + ... +``` + +Tests can override with `app.dependency_overrides[get_settings] = lambda: FakeSettings()`. + +### Fix 2 — `@property` on a config holder + +```python +class _Config: + _settings: Settings | None = None + + @property + def settings(self) -> Settings: + if self._settings is None: + self._settings = Settings() + return self._settings + +config = _Config() # safe — no Settings() call yet +``` + +### Fix 3 — pytest `monkeypatch` / `.env` file + +For tests that genuinely need the real Settings, provide env vars via a `conftest.py`: + +```python +@pytest.fixture(autouse=True) +def env_vars(monkeypatch): + monkeypatch.setenv("MONGO_URL", "mongodb://localhost:27017/test") + monkeypatch.setenv("SECRET_KEY", "test-secret") +``` + +### Why Python 3.14 Makes This Worse + +Python 3.14 has no pre-built wheels for Rust-extension packages (`pydantic-core`, `cryptography`). Poetry silently installs a pure-Python fallback that may behave differently or be missing functionality. Always pin `python = "^3.11"` in `pyproject.toml` and run tests in Docker matching the production Python version. + +## Related Concepts + +- [[wiki/concepts/poetry-docker-version-mismatch]] — Poetry / Python version mismatch causing silent failures +- [[wiki/concepts/time-sleep-blocks-asyncio]] — another class of import-time footgun in async FastAPI + +## Sources + +- [[daily/2026-04-30.md]] — Session 13:36, test suite fixes; module-level Settings() crashes, aiohttp mock pattern diff --git a/wiki/concepts/sudo-git-clone-root-ownership.md b/wiki/concepts/sudo-git-clone-root-ownership.md new file mode 100644 index 0000000..8e87d87 --- /dev/null +++ b/wiki/concepts/sudo-git-clone-root-ownership.md @@ -0,0 +1,74 @@ +--- +title: "sudo git clone Makes Files Root-Owned — User git Pull Fails" +aliases: + - sudo-git-clone-root-files + - git-permission-denied-fetch-head +tags: + - git + - linux + - server + - permissions +sources: + - "daily/2026-04-30.md" +created: 2026-04-30 +updated: 2026-04-30 +--- + +# sudo git clone Makes Files Root-Owned — User git Pull Fails + +Running `sudo git clone` on a server creates every file and directory — including the entire `.git/` folder — owned by `root`. Any subsequent `git pull` or `git fetch` run as a regular user fails with `Permission denied` on `.git/FETCH_HEAD` (or similar index files), even though the user can read the working tree. + +## Key Points + +- `sudo git clone` → all files owned by `root:root` +- `git pull` as a non-root user hits a write permission error on `.git/FETCH_HEAD` +- The error message looks like a network or credential issue but is purely a filesystem ownership problem +- Fix: `sudo chown -R $USER:$USER /opt/project` +- Prevention: never use `sudo` for `git clone` unless the repo must be root-owned + +## Details + +### The Error + +``` +error: cannot open .git/FETCH_HEAD: Permission denied +``` + +or + +``` +fatal: Unable to create '/opt/project/.git/index.lock': Permission denied +``` + +### Fix + +```bash +sudo chown -R $USER:$USER /opt/project +# Verify +ls -la /opt/project/.git/ +``` + +### Prevention + +If deploying to `/opt/` or `/srv/` (root-owned dirs), create the directory first, then clone as the service user: + +```bash +sudo mkdir -p /opt/project +sudo chown deploy:deploy /opt/project +git clone git@github.com:org/project.git /opt/project +``` + +Or use `sudo -u deploy git clone ...` to clone as the deploy user directly. + +### Why This Happens + +`sudo` switches the effective UID to root. `git clone` creates all files with the current effective UID as owner. There is no `--chown` flag on `git clone`, unlike `docker cp`. + +## Related Concepts + +- [[wiki/concepts/monorepo-deploy-script-pitfall]] — another class of silent git failure during deploys +- [[wiki/concepts/python-service-deployment-dotenv]] — deploy checklist for Python services + +## Sources + +- [[daily/2026-04-30.md]] — Session 12:11, re-deploy after project folder deletion; sudo git clone footgun discovered diff --git a/wiki/connections/_index.md b/wiki/connections/_index.md index 9852317..4176a0d 100644 --- a/wiki/connections/_index.md +++ b/wiki/connections/_index.md @@ -15,5 +15,7 @@ | [[wiki/connections/box-api-hotfolder-pattern]] | Box API ↔ hotfolder daemon — always paired; archive pattern prevents double-processing | 2026-04-27 | 2026-04-27 | | [[wiki/connections/docker-dns-adguard-split-horizon]] | Docker DNS ↔ AdGuard split-horizon — Docker containers inherit router DNS, not AdGuard; explicit dns: config required | daily/2026-04-28.md | 2026-04-28 | +| [[wiki/connections/celery-prefork-faster-whisper-memory-stacking]] | Celery prefork fork-all ↔ faster_whisper model-at-startup — CONCURRENCY × model_size GB consumed before first task | daily/2026-04-30.md | 2026-04-30 | + diff --git a/wiki/connections/celery-prefork-faster-whisper-memory-stacking.md b/wiki/connections/celery-prefork-faster-whisper-memory-stacking.md new file mode 100644 index 0000000..53e916c --- /dev/null +++ b/wiki/connections/celery-prefork-faster-whisper-memory-stacking.md @@ -0,0 +1,86 @@ +--- +title: "Connection: Celery Prefork × faster_whisper — Memory Stacking" +connects: + - "concepts/celery-prefork-pool-startup-memory" + - "concepts/faster-whisper-startup-memory" +sources: + - "daily/2026-04-30.md" +created: 2026-04-30 +updated: 2026-04-30 +--- + +# Connection: Celery Prefork × faster_whisper — Memory Stacking + +## The Connection + +Two independent startup-memory behaviours combine multiplicatively when `faster_whisper` is imported inside a Celery worker module: + +1. **Celery prefork** forks ALL `CONCURRENCY` worker processes at `celery worker` start — each is a full Python interpreter with all imports loaded. +2. **faster_whisper** loads the full transcription model into RAM at import time (when `WhisperModel(...)` is called at module level or in a module-level `@worker_init` signal handler). + +Result: `CONCURRENCY=4` with a 400 MB Whisper model = **1.6 GB** consumed before the first transcription task is dequeued. + +## Key Insight + +> Neither behaviour is a bug in isolation — the danger is invisible until they are combined in the same container. + +The `faster-whisper-startup-memory` article documents the per-container model loading cost. The `celery-prefork-pool-startup-memory` article documents the per-worker process forking cost. When they stack, the formula becomes: + +``` +total_startup_RAM = CONCURRENCY × (base_worker_MB + model_size_MB) +``` + +Example with `large-v3` model (~1.5 GB) and `CONCURRENCY=4`: + +``` +4 × (80 MB interpreter + 1500 MB model) = 6.3 GB before first task +``` + +A container with a 4 GB memory limit is OOM-killed before it processes anything. + +## Evidence + +- Session 21:37 (2026-04-30): ffmpeg-worker with `CONCURRENCY=20`, ~120 MB/process → 2.4 GB, container OOM-killed, 15-minute pipeline stall +- The stall was compounded because Celery silently retries tasks that were in-flight when the worker died, creating a second wave of OOM on restart + +## Solutions + +### Option A — Reduce concurrency to match model size + +``` +CONCURRENCY = floor(container_memory_MB / (base_MB + model_MB)) +``` + +### Option B — Separate transcription into its own single-worker container + +Keep `CONCURRENCY=1` for the whisper worker, scale by adding containers, not by increasing CONCURRENCY. Each container has exactly one model copy. + +### Option C — Load model lazily (inside the task, not at import) + +```python +_model = None + +@app.task +def transcribe(audio_path: str): + global _model + if _model is None: + _model = WhisperModel("large-v3") + return _model.transcribe(audio_path) +``` + +Downside: first task in each process pays the load latency (~5–15 s). Subsequent tasks in the same process reuse the loaded model. + +### Option D — Use `solo` or `threads` pool + +`CELERY_POOL=solo` runs tasks in the main process with no forking — only one model copy regardless of logical concurrency. Appropriate for GPU workers where parallelism is handled at the GPU level. + +## Related Concepts + +- [[wiki/concepts/celery-prefork-pool-startup-memory]] — Celery fork-all-at-startup behaviour +- [[wiki/concepts/faster-whisper-startup-memory]] — model loaded at container start +- [[wiki/concepts/celery-queue-worker-specialization]] — isolating whisper work to dedicated containers +- [[wiki/concepts/docker-compose-cpu-limits-env]] — setting memory limits in Compose + +## Sources + +- [[daily/2026-04-30.md]] — Session 21:37, Celery ffmpeg-worker OOM; identified as combined prefork + model-loading issue diff --git a/wiki/log.md b/wiki/log.md index 6800109..c361579 100644 --- a/wiki/log.md +++ b/wiki/log.md @@ -1,6 +1,12 @@ # Build Log +## [2026-04-30T23:30:00+01:00] compile | 2026-04-30.md (pass 2) +- Source: daily/2026-04-30.md +- Articles created: [[wiki/concepts/celery-prefork-pool-startup-memory]], [[wiki/concepts/sudo-git-clone-root-ownership]], [[wiki/concepts/python-fastapi-module-level-singletons]], [[wiki/connections/celery-prefork-faster-whisper-memory-stacking]] +- Articles updated: (none) +- Index updates: [[wiki/concepts/_index]] (86→89); [[wiki/connections/_index]] (9→10); [[wiki/_master-index]] (concepts 86→89, connections 9→10) + ## [2026-04-30T21:00:00+01:00] compile | 2026-04-30.md - Source: daily/2026-04-30.md - Articles created: [[wiki/concepts/pydub-ffmpeg-silent-dependency]], [[wiki/concepts/lameenc-bytearray-gcs-upload]], [[wiki/concepts/apache-mod-alias-proxy-priority]], [[wiki/concepts/faster-whisper-startup-memory]], [[wiki/concepts/celery-redis-queue-flush-on-deterministic-error]], [[wiki/concepts/cline-lm-studio-openai-compatible]], [[wiki/concepts/celery-queue-worker-specialization]], [[wiki/concepts/gcs-resumable-upload-pattern]]