2022-05-30
Systemd memory limits and strict memory overcommit
We run some of our servers with strict overcommit handling for
total virtual memory, which unfortunately
periodically causes us heartburn because an increasing number of
important things run as non-root users and so can't be protected
from overcommit. For instance, the SLURM
control daemon on our SLURM cluster's master node runs as the 'slurm
' user instead
of root, and so one day we had slurmctld die on us as the result
of the system being run into its overcommit memory limits.
(This was not an out of memory kill; instead, it seems to have been that slurmctld was the unlucky party that next tried to allocate more memory once the system had hit the limit. The allocation failed deep inside POSIX thread creation, the error propagated back to slurmctld, and it wound up exiting.)
We also use systemd resource controls
to limit how much memory (and CPU) each person can use on shared
systems. The obvious thing recently occurred to me, namely that we
can use the same systemd resource controls to limit how much memory
user sessions can use in total, thereby possibly fencing off things
like the SLURM daemon so that they wouldn't be driven out of memory
just because of someone who had logged in and was running a big
process. This is possible for user sessions because systemd puts
everything into a hierarchy, where all user sessions are put under
user.slice
; if you put a memory limit on user.slice
, it applies
to everyone collectively.
After thinking about this a bit more, I realized that this is not
quite the same thing. The fundamental issue is what we can limit
through systemd (and through cgroups in general) is actual RAM
usage, but what strict overcommit cares about is committed address
space. If you allocate memory and then immediately use it, the two
are more or less the same, but as we've seen not all programs do
this. The important consequence is
that a systemd memory limit on user.slice
doesn't stop people
from running a strict overcommit system out of memory. More exactly,
it doesn't prevent them pushing the system into a state where nobody
can allocate any more memory and processes start failing as a result.
I suspect that what we should do is pick one or the other; either
we use strict memory overcommit or we use systemd limits, but we
don't try to mix both. A lot of the time I suspect that we should
use systemd memory limits, not strict overcommit, because memory
limits give us much finer control over what gets limited and what
doesn't. On Ubuntu 22.04, where systemd normally uses cgroup v2,
we may experiment with setting a MemoryMin
on system.slice
to more or less reserve a certain amount of memory
for system services, instead of trying to limit user.slice
to most
but not all of the system's memory.
(If we're concerned about the possibility of swap thrashing, systemd with cgroup v2 also allows setting a strict limit on the amount of swap that can be used, for example by all of a person's processes together.)