Wandering Thoughts archives

2024-12-27

Cgroup V2 memory limits and their potential for thrashing

Recently I read 32 MiB Working Sets on a 64 GiB machine (via), which recounts how under some situations, Windows could limit the working set ('resident set') of programs to 32 MiB, resulting in a lot of CPU time being spent on soft (or 'minor') page faults. On Linux, you can do similar things to limit memory usage of a program or an entire cgroup, for example through systemd, and it occurred to me to wonder if you can get the same thrashing effect with cgroup V2 memory limits. Broadly, I believe that the answer depends on what you're using the memory for and what you use to set limits, and it's certainly possible to wind up setting limits so that you get thrashing.

(As a result, this is now something that I'll want to think about when setting cgroup memory limits, and maybe watch out for.)

Cgroup V2 doesn't have anything that directly limits a cgroup's working set (what is usually called the 'resident set size' (RSS) on Unix systems). The closest it has is memory.high, which throttles a cgroup's memory usage and puts it under heavy memory reclaim pressure when it hits this high limit. What happens next depends on what sort of memory pages are being reclaimed from the process. If they are backed by files (for example, they're pages from the program, shared libraries, or memory mapped files), they will be dropped from the process's resident set but may stay in memory so it's only a soft page fault when they're next accessed. However, if they're anonymous pages of memory the process has allocated, they must be written to swap (if there's room for them) and I don't know if the original pages stay in memory afterward (and so are eligible for a soft page fault when next accessed). If the process keeps accessing anonymous pages that were previously reclaimed, it will thrash on either soft or hard page faults.

(The memory.high limit is set by systemd's MemoryHigh=.)

However, the memory usage of a cgroup is not necessarily in ordinary process memory that counts for RSS; it can be in all sorts of kernel caches and structures. The memory.high limit affects all of them and will generally shrink all of them, so in practice what it actually limits depends partly on what the processes in the cgroup are doing and what sort of memory that allocates. Some of this memory can also thrash like user memory does (for example, memory for disk cache), but some won't necessarily (I believe shrinking some sorts of memory usage discards the memory outright).

Since memory.high is to a certain degree advisory and doesn't guarantee that the cgroup never goes over this memory usage, I think people more commonly use memory.max (for example, via the systemd MemoryMax= setting). This is a hard limit and will kill programs in the cgroup if they push hard on going over it; however, the memory system will try to reduce usage with other measures, including pushing pages into swap space. In theory this could result in either swap thrashing or soft page fault thrashing, if the memory usage was just right. However, in our environments cgroups that hit memory.max generally wind up having programs killed rather than sitting there thrashing (at least for very long). This is probably partly because we don't configure much swap space on our servers, so there's not much room between hitting memory.max with swap available and exhausting the swap space too.

My view is that this generally makes it better to set memory.max than memory.high. If you have a cgroup that overruns whatever limit you're setting, using memory.high is much more likely to cause some sort of thrashing because it never kills processes (the kernel documentation even tells you that memory.high should be used with some sort of monitoring to 'alleviate heavy reclaim pressure', ie either raise the limit or actually kill things). In a past entry I set MemoryHigh= to a bit less than my MemoryMax setting, but I don't think I'll do that in the future; any gap between memory.high and memory.max is an opportunity for thrashing through that 'heavy reclaim pressure'.

linux/CgroupV2MemoryLimitsAndThrashing written at 23:10:34;


Page tools: See As Normal.
Search:
Login: Password:

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.