obsidian/wiki/concepts/systemd-memory-oom-policy.md
2026-04-26 21:10:46 +01:00

5.4 KiB

title aliases tags sources created updated
systemd — MemoryMax, OOMPolicy, and Auto-Restart for Memory-Constrained Services
systemd-oom
systemd-memory-limit
systemd-oom-policy
oom-continue-restart
systemd
linux
oom
memory
lxc
homelab
proxmox
service-management
daily/2026-04-24.md
2026-04-24 2026-04-24

systemd — MemoryMax, OOMPolicy, and Auto-Restart for Memory-Constrained Services

When a systemd-managed service is killed by the Linux OOM (Out-Of-Memory) killer, the default systemd behavior is to put the service into a failed state and not restart it. This means a memory spike that would otherwise be transient results in a permanently dead service until someone manually intervenes. Adding three directives — MemoryMax, OOMPolicy=continue, and RestartSec — changes this to graceful, automatic recovery.

Key Points

  • Default OOMPolicy=stop: when the OOM killer fires on a service's process, systemd stops the service and does not restart it — the unit enters failed state
  • OOMPolicy=continue: systemd restarts the service (subject to Restart= policy) instead of stopping after an OOM kill
  • MemoryMax=: sets a hard memory ceiling for the service's cgroup; the OOM killer will target this service's processes when the limit is reached rather than randomly killing other processes on the system
  • RestartSec=10: adds a 10-second delay before the restart — prevents rapid restart loops if the OOM condition persists
  • This pattern is safe for idempotent services (web servers, API backends, daemons) that can be restarted without data loss

Details

The Three-Directive Pattern

[Service]
# Hard memory ceiling — OOM fires within the service's cgroup
MemoryMax=7G

# Continue (restart) instead of stopping after OOM kill
OOMPolicy=continue

# Always restart if the process exits for any reason
Restart=always

# Wait 10 seconds before restart — prevents hot restart loops
RestartSec=10

MemoryMax vs MemoryHigh:

  • MemoryHigh is a soft limit — the kernel throttles the service when it approaches the limit
  • MemoryMax is a hard limit — the OOM killer fires when the process exceeds it
  • Use MemoryMax when you want predictable behavior at a ceiling; use both for graduated throttling

OOMPolicy options:

Value Behavior
stop Service enters failed state (default)
continue Service is restarted per Restart= policy
kill Entire service unit (including forked processes) is killed

Applying to an Existing Unit

Prefer a drop-in override to modifying the unit file directly — easier to revert and survives package upgrades:

# Create a drop-in directory for the service
mkdir -p /etc/systemd/system/immich-web.service.d/

# Write the override
cat > /etc/systemd/system/immich-web.service.d/memory.conf << 'EOF'
[Service]
MemoryMax=7G
OOMPolicy=continue
Restart=always
RestartSec=10
EOF

# Apply
systemctl daemon-reload
systemctl restart immich-web

Verifying the Directives Loaded

systemctl show immich-web --property=MemoryMax,OOMPolicy,Restart,RestartUSec
# Expected:
# MemoryMax=7516192768  (7 * 1024^3)
# OOMPolicy=continue
# Restart=always
# RestartUSec=10000000  (10 seconds in microseconds)

If MemoryMax=18446744073709551615 (the max uint64), the directive didn't load — check daemon-reload ran.

Diagnosing an OOM Kill

# Check if a service was OOM killed
journalctl -u immich-web -n 50 | grep -i oom
# Look for: "Out of memory: Kill process" or "oom-kill event in cgroup"

# System-level OOM events
dmesg | grep -i "oom\|killed process"

# Check current memory usage of the service
systemctl status immich-web
# Shows: Memory: X.XG (limit: 7.0G)

When to Use This Pattern

Appropriate for:

  • Web servers / API backends — stateless, can restart cleanly
  • Photo services (Immich) — ML processing creates RAM spikes; restart is safe
  • Background daemons — should be self-healing in production

Not appropriate for:

  • Databases — OOM mid-write can corrupt data; use MemoryMax + OOMPolicy=stop + alerting instead
  • Services holding locks — restart may leave stale locks
  • Services where a restart loop would cause an outage (consider StartLimitIntervalSec and StartLimitBurst to cap restart attempts)

Capping Restart Attempts

To prevent infinite restart loops when the OOM condition is persistent:

[Service]
OOMPolicy=continue
Restart=always
RestartSec=30

[Unit]
# Max 3 restarts in 5 minutes before giving up
StartLimitIntervalSec=300
StartLimitBurst=3

After 3 OOM-kills in 5 minutes, the unit enters failed state — signaling that the root cause needs attention rather than perpetual restarts.

Sources

  • daily/2026-04-24.md — Immich OOM recovery: immich-web.service killed at 5.1/6 GB; MemoryMax=7G, OOMPolicy=continue, RestartSec=10 added to unit; pattern generalized here for reuse across other memory-constrained homelab services