2024-01-17
Some interesting metrics you can get from cgroup V2 systems
In my roundup of what Prometheus exporters we use, I mentioned that we didn't have a way of generating resource usage metrics for systemd services, which in practice means unified cgroups (cgroup v2). This raises the good question of what resource usage and performance metrics are available in cgroup v2 that one might be interested in collecting for systemd services.
You can want to know about resource usage of systemd services (or more generally, systemd units) for a variety of reasons. Our reason is generally to find out what specifically is using up some resource on a server, and more broadly to have some information on how much of an impact a service is having. I'm also going to assume that all of the relevant cgroup resource controllers are enabled, which is increasingly the case on systemd based systems.
In each cgroup, you get the following:
- pressure stall information for CPU,
memory, IO, and these days IRQs. This should give you a good idea of
where contention is happening for these resources.
- CPU usage information,
primarily the classical count of user, system, and total usage.
- IO statistics (if you have the right things enabled),
which are enabled on some but not all of our systems. For us, this
appears to have the drawback that it doesn't capture information
for NFS IO, only local disk IO, and it needs decoding to create
useful information (ie, information associated with a named device,
which you find out the mappings for from /proc/partitions and
/proc/self/mountinfo).
(This might be more useful for virtual machine slices, where it will probably give you an indication of how much IO the VM is doing.)
- memory usage information,
giving both a simple amount assigned to that cgroup ('
memory.current
') and a relatively detailed breakdown of how much of what sorts of memory has been assigned to the cgroup ('memory.stat
'). As I've found out repeatedly, the simple number can be misleading depending on what you want to really know, because it includes things like inactive file cache and inactive, reclaimable kernel slab memory.(You also get swap usage, in '
memory.swap.current
', and there's also 'memory.zswap.current
'.)In a Prometheus exporter, I might simply report all of the entries in
memory.stat
and sort it out later. This would have the drawback of creating a bunch of time series, but it's probably not an overwhelming number of them.
Although the cgroup doesn't directly tell you how many processes
and threads it contains, you can read 'cgroup.procs
' and
'cgroups.threads
' to count how many entries they have. It's
probably worth reporting this information.
The root cgroup has some or many of these files, depending on your
setup. Interestingly, in Fedora and Ubuntu 22.04, it seems to have
an 'io.stat
' even when other cgroups don't have it, although I'm
not sure how useful this information is for the root cgroup.
Were I to write a systemd cgroup metric collector, I'd probably
only have it report on first level and second level units (so
'systemd.slice' and then 'cron.service' under systemd.slice). Going
deeper than that doesn't seem likely to be very useful in most cases
(and if you go into user.slice, you have cardinality issues). I
would probably skip 'io.stat
' for the first version and leave it
until later.
PS: I believe that some of this information can be visualized live through systemd-cgtop. This may be useful to see if your particular set of systemd services and so on even have useful information here.