2022-06-01
Setting up Linux fair share CPU scheduling with systemd and cgroup v2
These days, modern versions of systemd on modern Linuxes, including the recently released Ubuntu 22.04, are using unified cgroups (cgroup v2). How to enable fair share CPU scheduling in this environment is different than how it used to work with systemd using cgroup v1. How this works currently on Ubuntu 22.04 with systemd 249 is sufficiently 'clever' that it may well change in the future.
In cgroup v2, fair share CPU scheduling for a cgroup is enabled
by enabling the 'cpu' controller in that cgroup. However, systemd doesn't provide any
good direct way to enable specific cgroup controllers; instead it
seems to enable them when it thinks that it needs them due to some
property that you set. In the case of the cpu controller, you get
it enabled in a specific cgroup by setting CPUWeight
to some value on a child unit. Normally you'll want to set CPUWeight
to the default value of '100', so that that child unit and all of
its peers get a predictable value for cpu.weight
that's the
same.
If you want to enable fair share scheduling across users, you need
to set CPUWeight
on some user-<uid>.slice so that the user.slice
cgroup gets the cpu controller enabled. Of course, this requires such a
user-<uid>.slice to exist in the first place, which generally means that
you're going to need to hook into session setup, for example through
pam_exec.
As before, I believe that if everyone logs off, user.slice itself will
disappear and so you'll have to re-establish this setting the next time
around. Otherwise, as long as user.slice persists, the cpu controller
stays enabled (as far as I know). Locally, we are using 'systemctl
--runtime set-property ...
' to set CPUWeight
only non-permanently.
Otherwise I fear we would wind up with a thicket of settings for various
users as they're the first ones to log in this time around.
(In earlier versions of systemd on cgroup v1, it was sufficient to turn CPU accounting on on some user slice, or sometimes a few of them. These days CPU accounting seems to default to on, without enabling the 'cpu' controller, and turning it on again doesn't do anything. Possibly this is because cgroup v2 seems to track CPU usage of cgroups even if the 'cpu' controller isn't enabled, so systemd decides to say that CPU accounting is always on.)
If for some reason you want to enable fair share scheduling across
system services, you can pick one that's always going to be there and
set CPUWeight=100
permanently on it. I don't know how you'd arrange to
set up fair share scheduling for virtual machines and containers (under
machine.slice); possibly you could set permanent CPUWeight properties
on all of your long-term VMs and containers, so that at least one of
them would be active and trigger machine.slice having the cpu controller
enabled on it.
(It would be cleaner if systemd provided a direct way to enable a particular resource controller in a unit like system.slice, user.slice, or machine.slice, but so it goes.)
If you're enabling fair share scheduling for users (ie, for children
of user.slice) and you want system services to get CPU priority
instead of everything under system.slice being fair share scheduled
against everything under user.slice (ie, each collectively getting
half of the available CPU), then you'll need to set an explicit
CPUWeight
for either system.slice or user.slice. It's probably
easier to do this for system.slice, since it's always going to be
there. I'm not sure what value I'd set.
(I suppose you can think of it in terms of how many of the machine's
CPUs you want all users to be able to use under high CPU contention.
For example, if your have has four CPUs and you'd like system
services to collectively get three of them under load, you can set
system.slice's CPUWeight
to 300. This assumes you don't have VMs
that are also contending for CPU under machine.slice.)