2023-12-06
Understanding another piece of per-cgroup memory usage accounting
A while back I wrote a program I call 'memdu' to report a du-like
hierarchical summary of how much memory is being used by each logged
in user and each system service, based on systemd's MemoryAccounting
setting and the general Linux cgroup (v2) memory accounting.
Cgroups expose a number of pieces of information about this, starting
with memory.current
, the current amount of memory 'being used by'
the cgroup and its descendants. What being used by means here is
that the kernel has attributed this memory to the cgroup, and it
counts all memory usage attributed to the cgroup, both user level
and in the kernel. As I very soon found out, this number can be misleading if
what you're really interested in is how much user level memory the
cgroup is actively using.
My first encounter with this was for a bunch of memory used by the kernel filesystem cache, which was attributed first to a running virtual machine and then to the general 'machine.slice' cgroup when the virtual machine was shut down and its cgroup went away. (Well, it was always attributed to machine.slice as well as the individual virtual machine, but when the virtual machine existed you could see that a lot of machine.slice's memory usage was from the child VM.)
As I recently discovered, another source of this is reclaimable
(kernel) slab memory. It's possible to have an essentially inactive
user cgroup with small process memory usage but gigabytes of memory
attributed to it from memory.stat's 'slab_reclaimable
'. At
some point this slab memory was actively used, but it's now not,
and presumably it lingers around mostly because the overall system
hasn't been under enough memory pressure to trigger reclaiming it.
Having my memdu program report the memory usage of the cgroup
including this memory is in one sense honest, but it's not usually
useful and it can be alarming.
(According to the documentation,
you can manually trigger a kernel reclaim against the cgroup by
writing an amount to 'memory.reclaim
'. But if there's no general
memory pressure, I think the only reason to do this is aesthetics.)
If I knew enough about the kernel memory systems in practice, I could probably read through the documentation about the cgroup memory.stat file and work out what things I wanted to remove from memory.current to get more or less 'current directly and indirectly used user memory'. As it is, I don't have that knowledge so I suspect that I'm going to find more cases like this over time.
(How I find these is that someday I run my memdu program and it reports an absurd looking number for some cgroup, so I investigate and then fix it up with more heuristics. These days the program is in Python so it's pretty easy to add another case.)
I suspect that one of the general issues I'm running into is that what I want from my 'memdu' program isn't well specified and may not be something that the kernel can really give me. The question of how much memory a cgroup is using depends on what I mean by 'using' and what sort of memory I care about. The kernel is only really set up to tell me how much memory has been attributed to a cgroup, and where it is in potentially overlapping categories in memory.stat.
(I assume that memory.stat
is comprehensive, so all memory in
memory.current
is accounted for somewhere in memory.stat
, but
I'm not sure of that.)