Some interesting metrics you can get from cgroup V2 systems

January 17, 2024

In my roundup of what Prometheus exporters we use, I mentioned that we didn't have a way of generating resource usage metrics for systemd services, which in practice means unified cgroups (cgroup v2). This raises the good question of what resource usage and performance metrics are available in cgroup v2 that one might be interested in collecting for systemd services.

You can want to know about resource usage of systemd services (or more generally, systemd units) for a variety of reasons. Our reason is generally to find out what specifically is using up some resource on a server, and more broadly to have some information on how much of an impact a service is having. I'm also going to assume that all of the relevant cgroup resource controllers are enabled, which is increasingly the case on systemd based systems.

In each cgroup, you get the following:

  • pressure stall information for CPU, memory, IO, and these days IRQs. This should give you a good idea of where contention is happening for these resources.

  • CPU usage information, primarily the classical count of user, system, and total usage.

  • IO statistics (if you have the right things enabled), which are enabled on some but not all of our systems. For us, this appears to have the drawback that it doesn't capture information for NFS IO, only local disk IO, and it needs decoding to create useful information (ie, information associated with a named device, which you find out the mappings for from /proc/partitions and /proc/self/mountinfo).

    (This might be more useful for virtual machine slices, where it will probably give you an indication of how much IO the VM is doing.)

  • memory usage information, giving both a simple amount assigned to that cgroup ('memory.current') and a relatively detailed breakdown of how much of what sorts of memory has been assigned to the cgroup ('memory.stat'). As I've found out repeatedly, the simple number can be misleading depending on what you want to really know, because it includes things like inactive file cache and inactive, reclaimable kernel slab memory.

    (You also get swap usage, in 'memory.swap.current', and there's also 'memory.zswap.current'.)

    In a Prometheus exporter, I might simply report all of the entries in memory.stat and sort it out later. This would have the drawback of creating a bunch of time series, but it's probably not an overwhelming number of them.

Although the cgroup doesn't directly tell you how many processes and threads it contains, you can read 'cgroup.procs' and 'cgroups.threads' to count how many entries they have. It's probably worth reporting this information.

The root cgroup has some or many of these files, depending on your setup. Interestingly, in Fedora and Ubuntu 22.04, it seems to have an 'io.stat' even when other cgroups don't have it, although I'm not sure how useful this information is for the root cgroup.

Were I to write a systemd cgroup metric collector, I'd probably only have it report on first level and second level units (so 'systemd.slice' and then 'cron.service' under systemd.slice). Going deeper than that doesn't seem likely to be very useful in most cases (and if you go into user.slice, you have cardinality issues). I would probably skip 'io.stat' for the first version and leave it until later.

PS: I believe that some of this information can be visualized live through systemd-cgtop. This may be useful to see if your particular set of systemd services and so on even have useful information here.

Written on 17 January 2024.
« What Prometheus exporters we use (as of the end of 2023)
Notes on the Linux kernel's 'irq' pressure stall information and meaning »

Page tools: View Source.
Search:
Login: Password:

Last modified: Wed Jan 17 22:40:46 2024
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.