Wandering Thoughts archives

2022-06-15

Understanding some peculiarities of per-cgroup memory usage accounting

Linux distributions that use systemd probably have cgroup memory accounting turned on (this is system.conf's DefaultMemoryAccounting, which defaulted to on starting in systemd 238). Memory accounting is handy because it will give you a hierarchical per-cgroup breakdown of memory usage for your services, user sessions, scopes, and so on, which you can see with tools like systemd-cgtop or perhaps your own tools. However if you look at this you can get some surprises, such as a virtual machine that you configured for 4 GB of RAM but systemd-cgtop says is using 14 GB of RAM. Or you could have no virtual machines running yet systemd-cgtop says machine.slice is using 24.8 GB.

What is going on is visible in the cgroup memory.stat file that shows a detailed breakdown of what the memory charged to a cgroup is actually being used for. Here is a processed version from that 24.8 GB machine.slice:

file 24.2G
inactive_file 21.8G
active_file 2.5G
slab 625.6M
slab_reclaimable 625.6M
[...]

What's happening here is two fold. First, the Linux kernel is doing its usual thing of using otherwise unused RAM as a filesystem cache. Second, it's allocating this filesystem cache RAM to a cgroup and a cgroup hierarchy on some basis (here, this is probably mostly filesystem cache of VM disk images).

When the specific virtual machine that was using its virtual disks was active, it would have had a cgroup of its own under machine.slice and that cgroup would have been charged for all of this RAM. When the virtual machine was shut down, its cgroup went away, but the RAM was still accounted to machine.slice in general because of the hierarchical nature of all cgroup resource accounting.

In the larger scale of things this is what you want. A cgroup should be charged for the RAM it uses inside the kernel as well as its user level RAM, because there are various ways of tying up RAM inside the kernel. In the specific case of filesystem cache and especially inactive filesystem cache it can look a little bit odd. Virtual machines mostly access unique filesystem data in the form of VM disk images, but there might be cases where processes in several different cgroups are all reading the same large chunks of data on disk and causing it to go into the filesystem cache. Which cgroup the memory gets charged to may feel a bit arbitrary, and I don't even know how the kernel decides.

(For instance, if some file cache memory becomes inactive file cache in one cgroup and then is accessed from another one, it's not clear if cgroup accounting will now charge it to the second cgroup as active file cache. The cgroup v2 documentation on memory ownership is not particularly specific, and in fact says it's in-deterministic.)

The larger lesson I draw from this is that the numbers shown in things like systemd-cgtop are not necessarily a reflection of how much memory a systemd service, user session, or whatever is actually using. Machine.slice has been charged for almost 25 GB of RAM, but it's currently using essentially none of it at the moment (there are no active processes under machine.slice right now on that machine).

If I write a serious cgroup memory usage program (instead of quick script hacks), I'll probably want to do something about this. If nothing else, I probably want to subtract the inactive_file number from the nominal total RAM usage. Otherwise I get a pretty misleading picture of what I'm really interested in, which is something more or less like 'what is the cgroup's RSS'.

(Given the explicit cgroup v2 indeterminism for memory accessed by multiple cgroups, such as filesystem cache, perhaps I should subtract all 'file' memory from the current memory usage.)

linux/CgroupsMemoryUsageAccounting written at 20:36:29; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.