Systemd memory limits and strict memory overcommit

May 30, 2022

We run some of our servers with strict overcommit handling for total virtual memory, which unfortunately periodically causes us heartburn because an increasing number of important things run as non-root users and so can't be protected from overcommit. For instance, the SLURM control daemon on our SLURM cluster's master node runs as the 'slurm' user instead of root, and so one day we had slurmctld die on us as the result of the system being run into its overcommit memory limits.

(This was not an out of memory kill; instead, it seems to have been that slurmctld was the unlucky party that next tried to allocate more memory once the system had hit the limit. The allocation failed deep inside POSIX thread creation, the error propagated back to slurmctld, and it wound up exiting.)

We also use systemd resource controls to limit how much memory (and CPU) each person can use on shared systems. The obvious thing recently occurred to me, namely that we can use the same systemd resource controls to limit how much memory user sessions can use in total, thereby possibly fencing off things like the SLURM daemon so that they wouldn't be driven out of memory just because of someone who had logged in and was running a big process. This is possible for user sessions because systemd puts everything into a hierarchy, where all user sessions are put under user.slice; if you put a memory limit on user.slice, it applies to everyone collectively.

After thinking about this a bit more, I realized that this is not quite the same thing. The fundamental issue is what we can limit through systemd (and through cgroups in general) is actual RAM usage, but what strict overcommit cares about is committed address space. If you allocate memory and then immediately use it, the two are more or less the same, but as we've seen not all programs do this. The important consequence is that a systemd memory limit on user.slice doesn't stop people from running a strict overcommit system out of memory. More exactly, it doesn't prevent them pushing the system into a state where nobody can allocate any more memory and processes start failing as a result.

I suspect that what we should do is pick one or the other; either we use strict memory overcommit or we use systemd limits, but we don't try to mix both. A lot of the time I suspect that we should use systemd memory limits, not strict overcommit, because memory limits give us much finer control over what gets limited and what doesn't. On Ubuntu 22.04, where systemd normally uses cgroup v2, we may experiment with setting a MemoryMin on system.slice to more or less reserve a certain amount of memory for system services, instead of trying to limit user.slice to most but not all of the system's memory.

(If we're concerned about the possibility of swap thrashing, systemd with cgroup v2 also allows setting a strict limit on the amount of swap that can be used, for example by all of a person's processes together.)

Written on 30 May 2022.
« My wish for per-port IP access controls in systemd .service units
The basics of Linux fair share CPU scheduling in cgroup v2 ('unified cgroups') »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon May 30 22:44:19 2022
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.