Notes on cgroups and systemd's interaction with them as of Ubuntu 16.04
I wrote recently on putting temporary CPU and memory limits on a user, using cgroups and systemd's features to fiddle around with them on Ubuntu 16.04. In the process I wound up confused about various aspects of how things work today. Since then I've done a bit of digging and I want to write down what I've learned before I forget it again.
The overall cgroup experience is currently a bit confusing on Linux
because there are now two versions of cgroups, the original ('v1')
and the new version ('v2'). The kernel people consider v1 cgroups
to be obsolete and I believe that the systemd people do as well,
but in practice Ubuntu 16.04 (and even Fedora 25) use cgroup v1,
not v2. You find out which cgroup version your system is using by
looking at /proc/mounts
to see what sort of cgroup(s) you're
mounting. With cgroup v1, you'll see multiple mounts in /sys/fs/cgroup
with filesystem type cgroup
and various cgroup controllers specified
as mount options, eg:
[...] cgroup /sys/fs/cgroup/cpu,cpuacct cgroup rw,[...],cpu,cpuacct 0 0 cgroup /sys/fs/cgroup/pids cgroup rw,[...],pids 0 0 cgroup /sys/fs/cgroup/net_cls,net_prio cgroup rw,[...],net_cls,net_prio 0 0 [...]
According to the current kernel v2 documentation, v2 cgroup
would have a single mount with the filesystem type cgroup2
.
The current systemd.resource-control
manpage
discusses the systemd differences between v1 and v2 cgroups,
and in the process mentions that v2 cgroups are incomplete
because the kernel people can't agree on how to implement
bits of them.
In my first entry, I wondered in an
aside how you could tell if per-user fair share scheduling was on.
The answer is that it depends on how processes are organized into
cgroup hierarchies. You can see this for a particular process by
looking at /proc/<pid>/cgroup
:
11:devices:/user.slice 10:memory:/user.slice/user-915.slice 9:pids:/user.slice/user-915.slice 8:hugetlb:/ 7:blkio:/user.slice/user-915.slice 6:perf_event:/ 5:freezer:/ 4:cpu,cpuacct:/user.slice/user-915.slice 3:net_cls,net_prio:/ 2:cpuset:/ 1:name=systemd:/user.slice/user-915.slice/session-c188763.scope
What this means is documented in the cgroups(7)
manpage. The important
thing for us is the interaction between the second field (the
controller) and the path in the third field. Here we see that for
the CPU time controller (cpu,cpuacct
), my process is under my
user-NNN.slice
slice, not just systemd's overall user.slice
.
That means that I'm subject to per-user fair share scheduling on
this system. On another system, the result is:
[...] 5:cpu,cpuacct:/user.slice [...]
Here I'm not subject to per-user fair share scheduling, because
I'm only under user.slice
and I'm thus not separated out from
processes that other users are running.
You can somewhat estimate the overall state of things by looking
at what's in the /sys/fs/cgroup/cpu,cpuacct/user.slice
directory.
If there are a whole bunch of user-NNN.slice
directories, processes
of those users are at least potentially subject to fair share
scheduling. If there aren't, processes from a user definitely aren't.
Similar things apply to other controllers, such as memory
.
(The presence of a user-915.slice
subdirectory doesn't mean that
all of my processes are subject to fair share scheduling, but it
does mean that some of them are. On the system I'm taking this
/proc/self/cgroup
output from, there are a number of people's
processes that are only in user.slice
in the CPU controller; these
processes would not be subject to per-user fair share scheduling,
even though other processes of the same user would be.)
If you want a full overview of how everything is structured for a
particular cgroup controller, you can use systemd-cgls
to see
this information all accumulated in one spot. You have to ask for
a particular controller specifically, for example 'systemd-cgls
/sys/fs/cgroup/cpu,cpuacct
', and obviously it's only really useful
if there actually is a hierarchy (ie, there are some subdirectories
under the controller's user.slice
directory). Unfortunately, as
far as I know there's no way to get systemd-cgls
to tell you the
user of a particular process if it hasn't already been put under a
user-NNN.slice
slice; you'll have to grab the PID and then use
another tool like ps
.
For setting temporary systemd resource limits on slices, it's
important to know that systemd completely removes those user-NNN.slice
slices when a user logs out from all of their sessions, and as part
of this forgets about your temporary resource limit settings (as
far as I know). This may make them more temporary than you expected.
I'm not sure if trying to set persistent resource limits with
'systemctl set-property user-NNN.slice ...
' actually works; my
results have been inconsistent, and since this doesn't work on
user.slice
I suspect it doesn't work
here either.
(As far as I can tell, temporary limits created with 'systemctl
--runtime set-property
' work in part by writing files to
/run/systemd/system/user-NNN.slice.d
. When a user fully logs out
and their user-NNN.slice
is removed, systemd appears to delete
the corresponding /run
directory, thereby tossing out your temporary
limits.)
Although you can ask systemd what it thinks the resource limits
imposed on a slice are (with 'systemctl show ...
'), the ultimate
authority is the cgroup control files in
/sys/fs/cgroup/<controller>/<path>
. If in doubt, I would look
there; the systemd.resource-control
manpage will tell you
what cgroup attribute is used for which systemd resource limit. Of
course you need to make sure that the actual runaway process you
want to be limited has actually been placed in the right spot in
the hierarchy of the relevant cgroup controller, by checking
/proc/<pid>/cgroup
.
(Yes, this whole thing is a complicated mess. Slogging through it
all has at least given me a better idea of what's going on and how
to inspect it, though. For example, until I started writing this
entry I hadn't spotted that systemd-cgls
could show you a specific
cgroup controller's hierarchy.)
|
|