obsidian/wiki/concepts/celery-prefork-pool-startup-memory.md
2026-05-01 09:38:54 +01:00

3.8 KiB
Raw Blame History

title aliases tags sources created updated
Celery Prefork Pool — All Workers Fork at Startup
celery-prefork-startup-memory
celery-concurrency-oom
celery
python
docker
memory
worker
daily/2026-04-30.md
2026-04-30 2026-04-30

Celery Prefork Pool — All Workers Fork at Startup

Celery's default prefork pool forks all CONCURRENCY worker processes immediately at startup, not lazily on first task. Each forked process loads the full Python interpreter plus all imports. With CONCURRENCY=20 and 120 MB per process, that is 2.4 GB of RAM consumed before a single task is processed — enough to OOM-kill a container and stall a pipeline for 15+ minutes while the cause is invisible in application logs.

Key Points

  • prefork (the default pool type) forks N processes at celery worker start time
  • Each process is a full Python interpreter with all imports loaded
  • Memory = CONCURRENCY × per_worker_MB consumed before any task runs
  • OOM manifests as the container being killed, not a Python exception
  • CONCURRENCY=1 is safe but eliminates parallelism — tune with the formula below

Details

Safe Concurrency Formula

CONCURRENCY = floor(container_memory_MB / per_worker_MB)

Measure per_worker_MB with:

# Start one worker, check RSS
celery -A app worker --concurrency=1 &
ps aux | grep celery

Common baselines (no heavy ML models):

  • Pure Python FastAPI worker: ~6080 MB
  • Worker that imports faster_whisper: ~400800 MB per worker (model loaded per process)
  • Worker that imports torch: 300500 MB baseline

Alternative Pool Types

Pool Startup behaviour Use case
prefork (default) All N processes fork immediately CPU-bound tasks
solo Single-process, no fork Dev / low-memory containers
gevent / eventlet Green threads, shared process I/O-bound tasks
threads OS threads, shared process I/O-bound, simpler than gevent

Switch via CELERY_POOL=solo or --pool=gevent.

Stacking with ML Libraries

If a worker imports a model library at module level (e.g. faster_whisper, torch), that model is loaded into every forked process. With CONCURRENCY=4 and a 400 MB model, startup RAM = 1.6 GB before any inference runs. See the connection article.

Symptoms

  • Container killed within 1030 seconds of docker compose up
  • No Python traceback — OOM killer logs in dmesg / docker events
  • docker stats shows memory spike to container limit then drop (restart)
  • Tasks never start processing; queue builds up

Real Incident (2026-04-30)

ffmpeg-worker container set CONCURRENCY=20 with ~120 MB per forked process. Total startup memory: 2.4 GB — consumed before any task was processed. Container hit OOM limit and was killed by Docker within seconds of docker compose up. The pipeline stalled for 15 minutes while the cause was invisible in application logs (no Python traceback, just container restart loop). Diagnosis: docker stats showed memory spike to limit then immediate drop, repeated every ~30 seconds. Fix: reduce CONCURRENCY using the formula floor(container_memory_MB / per_worker_MB).

Sources

  • daily/2026-04-30.md — Session 21:37, ffmpeg-worker OOM diagnosis; CONCURRENCY=20, 2.4 GB pre-task RAM, 15-minute pipeline stall