Understanding some peculiarities of per-cgroup memory usage accounting
Linux distributions that use systemd probably have cgroup memory accounting
turned on (this is system.conf's
which defaulted to on starting in systemd 238). Memory accounting
is handy because it will give you a hierarchical per-cgroup breakdown
of memory usage for your services, user sessions, scopes, and so
on, which you can see with tools like systemd-cgtop or
perhaps your own tools.
However if you look at this you can get some surprises, such as a virtual
machine that you configured for 4 GB of RAM but systemd-cgtop says is
using 14 GB of RAM. Or you could have no virtual machines running yet
systemd-cgtop says machine.slice is using 24.8 GB.
What is going on is visible in the cgroup
that shows a detailed breakdown of what the memory charged to a
cgroup is actually being used for. Here is a processed version from
that 24.8 GB machine.slice:
file 24.2G inactive_file 21.8G active_file 2.5G slab 625.6M slab_reclaimable 625.6M [...]
What's happening here is two fold. First, the Linux kernel is doing its usual thing of using otherwise unused RAM as a filesystem cache. Second, it's allocating this filesystem cache RAM to a cgroup and a cgroup hierarchy on some basis (here, this is probably mostly filesystem cache of VM disk images).
When the specific virtual machine that was using its virtual disks was active, it would have had a cgroup of its own under machine.slice and that cgroup would have been charged for all of this RAM. When the virtual machine was shut down, its cgroup went away, but the RAM was still accounted to machine.slice in general because of the hierarchical nature of all cgroup resource accounting.
In the larger scale of things this is what you want. A cgroup should be charged for the RAM it uses inside the kernel as well as its user level RAM, because there are various ways of tying up RAM inside the kernel. In the specific case of filesystem cache and especially inactive filesystem cache it can look a little bit odd. Virtual machines mostly access unique filesystem data in the form of VM disk images, but there might be cases where processes in several different cgroups are all reading the same large chunks of data on disk and causing it to go into the filesystem cache. Which cgroup the memory gets charged to may feel a bit arbitrary, and I don't even know how the kernel decides.
(For instance, if some file cache memory becomes inactive file cache in one cgroup and then is accessed from another one, it's not clear if cgroup accounting will now charge it to the second cgroup as active file cache. The cgroup v2 documentation on memory ownership is not particularly specific, and in fact says it's in-deterministic.)
The larger lesson I draw from this is that the numbers shown in things like systemd-cgtop are not necessarily a reflection of how much memory a systemd service, user session, or whatever is actually using. Machine.slice has been charged for almost 25 GB of RAM, but it's currently using essentially none of it at the moment (there are no active processes under machine.slice right now on that machine).
If I write a serious cgroup memory usage program (instead of quick script hacks), I'll probably want to do something about this. If nothing else, I probably want to subtract the inactive_file number from the nominal total RAM usage. Otherwise I get a pretty misleading picture of what I'm really interested in, which is something more or less like 'what is the cgroup's RSS'.
(Given the explicit cgroup v2 indeterminism for memory accessed by multiple cgroups, such as filesystem cache, perhaps I should subtract all 'file' memory from the current memory usage.)