Some practical notes on the systemd cgroups/units hierarchies

December 28, 2022

Systemd organizes everything on your system into a hierarchy of cgroups, or if you prefer a hierarchy of units that happen to be implemented with cgroups. However, what this hierarchy is (or is going to be) isn't always obvious, and sometimes what shows up matters, for example because you're generating per-cgroup metrics and might hit a cardinality explosion. So here are some notes on things you may see in, for example, systemd-cgls or 'systemctl status' (or if you're writing something to dump cgroup memory usage).

At the top level, systemd has a -.slice (the root slice or cgroup). Underneath that are up to three slices: user.slice, for all user sessions, system.slice, for all system services, and machine.slice, for your virtual machines that are started in ways that systemd knows about (for example, libvirt). You'll probably always have a system.slice and usually a user.slice if you're looking at a machine, but many of your machines may not have a machine.slice. There's also an init.scope, which has PID 1 in it, and possibly some essentially empty .mount cgroups that systemd-cgls won't bother showing you.

In a virtualization environment using libvirt, machine.slice will have a 'machine-qemu-<N>-<name>.scope' for every virtual machine, except that everything after the 'machine-qemu' bit will have bits of hex encoding, such as '\x2d' for the '-'. Under each active VM are some libvirt-created cgroups under 'libvirt', which isn't a systemd unit (I'm going to skip inventorying them, since I don't feel qualified to comment). If you've started some virtual machines and then shut them all down again, 'systemd-cgls' probably won't show you machine.slice any more, but it's still there as a cgroup and may well have some amount of RAM usage still charged to it.

Under user.slice, there will normally be a hierarchy for any individual user login that I'm going to present in a text diagram form (from systemd-cgls):

│ └─user-<UID>.slice
│   ├─user@<UID>.service
│   │ ├─session.slice
│   │ │ ├─dbus-broker.service
│   │ │ └─pipewire.service
│   │ └─init.scope
│   └─session-<NNN>.scope

Depending on the system setup, things may also be in an 'app.slice' and a 'background.slice' instead of a session.slice; see Desktop Environment Integration. What units you see started in the session and app slices depends on your system and how you're logging in to it (and you may be surprised by what gets started for a SSH login, even on a relatively basic server install).

(The init.scope for a user contains their systemd user instance.)

Under system.slice, you will normally see a whole succession of '<thing>.service', one for every active systemd service. You can also see a two level hierarchy for some things, such as templated systemd services:

  │ └─serial-getty@ttyS0.service 
  │ └─getty@tty1.service 
  │ └─postfix@-.service 

Templated systemd socket service units (with their long names) will show up (possibly very briefly) under a .slice unit for them, eg 'system-oidentd.slice'. This slice won't necessarily show in 'systemd-cgls' unless there's an active socket connection at the moment, but systemd seems to leave it there in /sys/fs/cgroup/system.slice even when it's inactive.

You can also get nested system.slice cgroups for dbus services:

  │ └─dbus-:1.14-org.freedesktop.problems@0.service

Inspecting the actual cgroups in /sys/fs/cgroup may also show you <thing>.mount, <thing>.socket, and <thing>.swap cgroups. Under rare circumstances you may also see a 'system-systemd\x2dfsck.slice' cgroup with one or more .service cgroups for fscks of specific devices.

Now that I've looked at all of this, my view is that if I'm generating resource usage metrics, I want to stop one level down from the top level user and system slices in the cgroup hierarchy (which means I will get 'system-oidentd.slice' but not the individually named socket activations). This captures most everything interesting and mostly doesn't risk cardinality explosions from templated units. Virtual machines under machine.slice need extra handling for cardinality, because the 'machine-qemu-<N>-[...]' is a constantly incrementing sequence number; I'll need to take that out somehow.

If I'm reporting on the fly on resource usage, it's potentially interesting to break user slices down into each session scope and then the user@<UID>.service. Being detailed under the user service runs into issues because there's so much potential variety in how processes are broken up into cgroups. I'd definitely want to be selective about what cgroups I report on so that only ones with interesting resource usage show up in the report.

Sidebar: User cgroups on GNOME and perhaps KDE desktops

You may remember my bad experience with systemd-oomd, where it killed my entire desktop session. Apparently one reason for systemd-oomd's behavior is that on a modern GNOME desktop, a lot of applications are confined into separate cgroups, so if (for example) your Firefox runs away with memory, systemd-oomd will only kill its cgroup, not your entire session-<NNN>.scope cgroup. On Fedora 36, this appears to look like this:

 │ ├─app.slice
 │ │ ├─app-cgroupify.slice
 │ │ │ └─cgroupify@app-gnome-firefox-2838.scope.service
 │ │ │   └─ 2845 /usr/libexec/cgroupify app-gnome-firefox-2838.scope
 │ │ ├─app-gnome-firefox-2838.scope
 │ │ │ ├─3028
 │ │ │ │ └─ 3028 /usr/lib64/firefox/firefox -contentproc [...]
 │ │ │ ├─3024
 │ │ │ │ └─ 3024 /usr/lib64/firefox/firefox -contentproc [...]

Gnome terminal sessions also have a complex structure:

 │ │ ├─app-org.gnome.Terminal.slice (#10028)
 │ │ │ ├─vte-spawn-04ae3315-d673-47fc-a31e-f657648a0146.scope (#10774)
 │ │ │ │ ├─ 2625 bash
 │ │ │ │ ├─ 2654 systemd-cgls
 │ │ │ │ └─ 2655 less
 │ │ │ └─gnome-terminal-server.service (#10508)
 │ │ │   └─ 2478 /usr/libexec/gnome-terminal-server

And then there's:

 │ │ ├─app-gnome\x2dsession\x2dmanager.slice (#5885)
 │ │ │ └─gnome-session-manager@gnome.service
 │ │ │   └─ 1663 /usr/libexec/gnome-session-binary [...]

So a GNOME desktop can have a lot of nested things in a session or under an app.slice.

The Fedora 36 Cinnamon desktop doesn't seem to go as far as this, with a bunch of things still running in the 'session-NNN.scope' unit, but it does seem to do some things to split Firefox and other processes off into their own systemd units and cgroups.

(Cgroupify apparently comes from the uresourced RPM package.)

Written on 28 December 2022.
« Our varied approaches to upgrading machines with local state
Some notes to myself on 'git log -G' (and sort of on -S) »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed Dec 28 23:38:23 2022
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.