Notes on cgroups and systemd's interaction with them as of Ubuntu 16.04

August 12, 2017

I wrote recently on putting temporary CPU and memory limits on a user, using cgroups and systemd's features to fiddle around with them on Ubuntu 16.04. In the process I wound up confused about various aspects of how things work today. Since then I've done a bit of digging and I want to write down what I've learned before I forget it again.

The overall cgroup experience is currently a bit confusing on Linux because there are now two versions of cgroups, the original ('v1') and the new version ('v2'). The kernel people consider v1 cgroups to be obsolete and I believe that the systemd people do as well, but in practice Ubuntu 16.04 (and even Fedora 25) use cgroup v1, not v2. You find out which cgroup version your system is using by looking at /proc/mounts to see what sort of cgroup(s) you're mounting. With cgroup v1, you'll see multiple mounts in /sys/fs/cgroup with filesystem type cgroup and various cgroup controllers specified as mount options, eg:

cgroup /sys/fs/cgroup/cpu,cpuacct cgroup rw,[...],cpu,cpuacct 0 0
cgroup /sys/fs/cgroup/pids cgroup rw,[...],pids 0 0
cgroup /sys/fs/cgroup/net_cls,net_prio cgroup rw,[...],net_cls,net_prio 0 0

According to the current kernel v2 documentation, v2 cgroup would have a single mount with the filesystem type cgroup2. The current systemd.resource-control manpage discusses the systemd differences between v1 and v2 cgroups, and in the process mentions that v2 cgroups are incomplete because the kernel people can't agree on how to implement bits of them.

In my first entry, I wondered in an aside how you could tell if per-user fair share scheduling was on. The answer is that it depends on how processes are organized into cgroup hierarchies. You can see this for a particular process by looking at /proc/<pid>/cgroup:


What this means is documented in the cgroups(7) manpage. The important thing for us is the interaction between the second field (the controller) and the path in the third field. Here we see that for the CPU time controller (cpu,cpuacct), my process is under my user-NNN.slice slice, not just systemd's overall user.slice. That means that I'm subject to per-user fair share scheduling on this system. On another system, the result is:


Here I'm not subject to per-user fair share scheduling, because I'm only under user.slice and I'm thus not separated out from processes that other users are running.

You can somewhat estimate the overall state of things by looking at what's in the /sys/fs/cgroup/cpu,cpuacct/user.slice directory. If there are a whole bunch of user-NNN.slice directories, processes of those users are at least potentially subject to fair share scheduling. If there aren't, processes from a user definitely aren't. Similar things apply to other controllers, such as memory.

(The presence of a user-915.slice subdirectory doesn't mean that all of my processes are subject to fair share scheduling, but it does mean that some of them are. On the system I'm taking this /proc/self/cgroup output from, there are a number of people's processes that are only in user.slice in the CPU controller; these processes would not be subject to per-user fair share scheduling, even though other processes of the same user would be.)

If you want a full overview of how everything is structured for a particular cgroup controller, you can use systemd-cgls to see this information all accumulated in one spot. You have to ask for a particular controller specifically, for example 'systemd-cgls /sys/fs/cgroup/cpu,cpuacct', and obviously it's only really useful if there actually is a hierarchy (ie, there are some subdirectories under the controller's user.slice directory). Unfortunately, as far as I know there's no way to get systemd-cgls to tell you the user of a particular process if it hasn't already been put under a user-NNN.slice slice; you'll have to grab the PID and then use another tool like ps.

For setting temporary systemd resource limits on slices, it's important to know that systemd completely removes those user-NNN.slice slices when a user logs out from all of their sessions, and as part of this forgets about your temporary resource limit settings (as far as I know). This may make them more temporary than you expected. I'm not sure if trying to set persistent resource limits with 'systemctl set-property user-NNN.slice ...' actually works; my results have been inconsistent, and since this doesn't work on user.slice I suspect it doesn't work here either.

(As far as I can tell, temporary limits created with 'systemctl --runtime set-property' work in part by writing files to /run/systemd/system/user-NNN.slice.d. When a user fully logs out and their user-NNN.slice is removed, systemd appears to delete the corresponding /run directory, thereby tossing out your temporary limits.)

Although you can ask systemd what it thinks the resource limits imposed on a slice are (with 'systemctl show ...'), the ultimate authority is the cgroup control files in /sys/fs/cgroup/<controller>/<path>. If in doubt, I would look there; the systemd.resource-control manpage will tell you what cgroup attribute is used for which systemd resource limit. Of course you need to make sure that the actual runaway process you want to be limited has actually been placed in the right spot in the hierarchy of the relevant cgroup controller, by checking /proc/<pid>/cgroup.

(Yes, this whole thing is a complicated mess. Slogging through it all has at least given me a better idea of what's going on and how to inspect it, though. For example, until I started writing this entry I hadn't spotted that systemd-cgls could show you a specific cgroup controller's hierarchy.)

Written on 12 August 2017.
« Some notes from my brief experience with the Grumpy transpiler for Python
Sorting out slice mutability in Go »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sat Aug 12 00:04:29 2017
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.