2024-12-27
Cgroup V2 memory limits and their potential for thrashing
Recently I read 32 MiB Working Sets on a 64 GiB machine (via), which recounts how under some situations, Windows could limit the working set ('resident set') of programs to 32 MiB, resulting in a lot of CPU time being spent on soft (or 'minor') page faults. On Linux, you can do similar things to limit memory usage of a program or an entire cgroup, for example through systemd, and it occurred to me to wonder if you can get the same thrashing effect with cgroup V2 memory limits. Broadly, I believe that the answer depends on what you're using the memory for and what you use to set limits, and it's certainly possible to wind up setting limits so that you get thrashing.
(As a result, this is now something that I'll want to think about when setting cgroup memory limits, and maybe watch out for.)
Cgroup V2 doesn't have anything that directly limits a cgroup's
working set (what is usually called the 'resident set size' (RSS) on Unix systems). The closest it has
is memory.high
, which throttles a cgroup's memory usage and puts
it under heavy memory reclaim pressure when it hits this high limit.
What happens next depends on what sort of memory pages are being
reclaimed from the process. If they are backed by files (for example,
they're pages from the program, shared libraries, or memory mapped
files), they will be dropped from the process's resident set but
may stay in memory so it's only a soft page fault when they're next
accessed. However, if they're anonymous pages of memory the process
has allocated, they must be written to swap (if there's room for
them) and I don't know if the original pages stay in memory afterward
(and so are eligible for a soft page fault when next accessed). If
the process keeps accessing anonymous pages that were previously
reclaimed, it will thrash on either soft or hard page faults.
(The memory.high limit is set by systemd's MemoryHigh=
.)
However, the memory usage of a cgroup is not necessarily in ordinary process memory that counts for RSS; it can be in all sorts of kernel caches and structures. The memory.high limit affects all of them and will generally shrink all of them, so in practice what it actually limits depends partly on what the processes in the cgroup are doing and what sort of memory that allocates. Some of this memory can also thrash like user memory does (for example, memory for disk cache), but some won't necessarily (I believe shrinking some sorts of memory usage discards the memory outright).
Since memory.high
is to a certain degree advisory and doesn't
guarantee that the cgroup never goes over this memory usage, I think
people more commonly use memory.max
(for example, via the systemd
MemoryMax=
setting). This is a hard limit and will kill programs in the cgroup
if they push hard on going over it; however, the memory system will
try to reduce usage with other measures, including pushing pages
into swap space. In theory this could result in either swap thrashing
or soft page fault thrashing, if the memory usage was just right.
However, in our environments cgroups that hit memory.max
generally
wind up having programs killed rather than sitting there thrashing
(at least for very long). This is probably partly because we don't
configure much swap space on our servers, so there's not much room
between hitting memory.max
with swap available and exhausting the
swap space too.
My view is that this generally makes it better to set memory.max
than memory.high
. If you have a cgroup that overruns whatever
limit you're setting, using memory.high
is much more likely to
cause some sort of thrashing because it never kills processes (the
kernel documentation even tells you that memory.high
should be
used with some sort of monitoring to 'alleviate heavy reclaim
pressure', ie either raise the limit or actually kill things). In
a past entry I set MemoryHigh= to a
bit less than my MemoryMax setting, but I don't think I'll do that
in the future; any gap between memory.high and memory.max is an
opportunity for thrashing through that 'heavy reclaim pressure'.