2022-05-31
The basics of Linux fair share CPU scheduling in cgroup v2 ('unified cgroups')
Linux has long been able to do fair share CPU scheduling, where CPU time was evenly divided between, for example, the various different users. Originally this took some manual work, but systemd wound up making it fairly easy. This is done through the Linux kernel 'cgroup' feature. This comes in two versions, the now-old cgroup (v1), and the not so new any more unified cgroups (cgroup v2). When using systemd, these turn out to be different in some important ways. To understand implementing fair share CPU scheduling with systemd in a cgroup v2 world, I want to start by understanding how cgroup v2 works and does fair share scheduling.
In cgroup v2, there is a single ('unified') hierarchy of cgroups and processes in them. However, a given spot in the hierarchy may not have all of the available resource controllers enabled in it, as covered in the "enabling and disabling" section of the documentation. Enabling a controller is potentially important because, as described there:
Enabling a controller in a cgroup indicates that the distribution of the target resource across its immediate children will be controlled. [...]
The inverse is true; if a controller is not enabled, the distribution of the resource to the cgroup's children is not being controlled.
The list of controllers that are being applied to immediate children
is in cgroup.subtree_control
. The presence of a particular
controller in that causes settings files related to the controller
to show up in the immediate children, which means that looking for
those settings files in a child (such as 'user.slice/user-1000.slice')
is a reliable indicator that the parent (ie, 'user.slice') has the
controller enabled.
Control of CPU scheduling is handled by the cgroup v2 'cpu'
controller.
Fair share scheduling is handled through a weight-based
CPU time distribution model (although you can also impose usage
limits). When the CPU controller is enabled for a cgroup, a
cpu.weight
file appears in all children (with a default value
of 100). When distributing CPU time to the children, all of their
cpu.weight
values are summed up and then each active child gets
CPU in proportion to their weight relative to the total. This means
that if all cpu.weight
files have the same value, all children
will get equal shares of the CPU time. The actual cpu.weight
values only matter if they're different; if they're all the same,
the value is arbitrary.
(All of this is assuming that the CPU is saturated. If not all of the CPU is being used, everyone gets as much of it as they want.)
If you're using systemd and you merely want fair share scheduling
between all users, the state (and change) you want is for user.slice to
have the 'cpu' controller enabled. Once the controller is enabled,
all individual user slices will acquire a default cpu.weight
,
and since it will all be the same, CPU will be shared evenly across
all active users.
However, at least in the cgroup v2 world this has an additional implication. You can't enable a controller in a child without it also being enabled in the parent, so enabling the 'cpu' controller in user.slice implies that it is also enabled in the root cgroup. In turn, this implies that cgroup v2 will do fair-share CPU scheduling across all direct children of the root. In a systemd based system, this means that user.slice as a whole will (by default) be fair share scheduled against system.slice, and also against any virtual machines or containers that you're running (which are under machine.slice). If you have a bunch of users trying to use CPU and also some important CPU-consuming system services (or virtual machines), this may limit the CPU usage of the latter in ways that you don't want, always giving the users half the available CPU time (or a third, if you have all of users, virtual machines, and system services burning up CPU).
If you don't want this to happen, I believe that your only option
is to adjust cpu.weight
so that, for example, system.slice has
a much higher weight than user.slice. This will tend to give system
services more of the CPU when there's contention between them and
user processes. Since weighted resources are distributed through
the hierarchy, there's no way to put everything in a big pool
together to all get equal shares. If you have one VM, two system
services, and three users all trying to use all the CPU and you're
using default weights in fair share scheduling, the VM will get
1/3rd, the two system services will collectively get 1/3rd (so 1/6th
each if you're doing fair share scheduling at that level), and the
three users will collectively get 1/3rd (and so 1/9th each).
Another consequence of all of this is that it's not possible to
limit the CPU usage of a cgroup (for example, a systemd service or
user) without also enabling fair share scheduling, since both are
done by the 'cpu' controller. In an ideal world maybe you could
turn fair share scheduling off by writing a 0 to cpu.weight
,
but currently you can't (the minimum value is 1). This means that
if you want to limit the CPU usage of users, you're getting fair
share scheduling between users and system services for "free",
whether or not you want it (and also fair share scheduling between
users, but you probably don't object to that).