Wandering Thoughts archives

2023-12-06

Understanding another piece of per-cgroup memory usage accounting

A while back I wrote a program I call 'memdu' to report a du-like hierarchical summary of how much memory is being used by each logged in user and each system service, based on systemd's MemoryAccounting setting and the general Linux cgroup (v2) memory accounting. Cgroups expose a number of pieces of information about this, starting with memory.current, the current amount of memory 'being used by' the cgroup and its descendants. What being used by means here is that the kernel has attributed this memory to the cgroup, and it counts all memory usage attributed to the cgroup, both user level and in the kernel. As I very soon found out, this number can be misleading if what you're really interested in is how much user level memory the cgroup is actively using.

My first encounter with this was for a bunch of memory used by the kernel filesystem cache, which was attributed first to a running virtual machine and then to the general 'machine.slice' cgroup when the virtual machine was shut down and its cgroup went away. (Well, it was always attributed to machine.slice as well as the individual virtual machine, but when the virtual machine existed you could see that a lot of machine.slice's memory usage was from the child VM.)

As I recently discovered, another source of this is reclaimable (kernel) slab memory. It's possible to have an essentially inactive user cgroup with small process memory usage but gigabytes of memory attributed to it from memory.stat's 'slab_reclaimable'. At some point this slab memory was actively used, but it's now not, and presumably it lingers around mostly because the overall system hasn't been under enough memory pressure to trigger reclaiming it. Having my memdu program report the memory usage of the cgroup including this memory is in one sense honest, but it's not usually useful and it can be alarming.

(According to the documentation, you can manually trigger a kernel reclaim against the cgroup by writing an amount to 'memory.reclaim'. But if there's no general memory pressure, I think the only reason to do this is aesthetics.)

If I knew enough about the kernel memory systems in practice, I could probably read through the documentation about the cgroup memory.stat file and work out what things I wanted to remove from memory.current to get more or less 'current directly and indirectly used user memory'. As it is, I don't have that knowledge so I suspect that I'm going to find more cases like this over time.

(How I find these is that someday I run my memdu program and it reports an absurd looking number for some cgroup, so I investigate and then fix it up with more heuristics. These days the program is in Python so it's pretty easy to add another case.)

I suspect that one of the general issues I'm running into is that what I want from my 'memdu' program isn't well specified and may not be something that the kernel can really give me. The question of how much memory a cgroup is using depends on what I mean by 'using' and what sort of memory I care about. The kernel is only really set up to tell me how much memory has been attributed to a cgroup, and where it is in potentially overlapping categories in memory.stat.

(I assume that memory.stat is comprehensive, so all memory in memory.current is accounted for somewhere in memory.stat, but I'm not sure of that.)

linux/CgroupsMemoryUsageAccountingII written at 23:28:05;


Page tools: See As Normal.
Search:
Login: Password:

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.