The basics of Linux fair share CPU scheduling in cgroup v2 ('unified cgroups')

May 31, 2022

Linux has long been able to do fair share CPU scheduling, where CPU time was evenly divided between, for example, the various different users. Originally this took some manual work, but systemd wound up making it fairly easy. This is done through the Linux kernel 'cgroup' feature. This comes in two versions, the now-old cgroup (v1), and the not so new any more unified cgroups (cgroup v2). When using systemd, these turn out to be different in some important ways. To understand implementing fair share CPU scheduling with systemd in a cgroup v2 world, I want to start by understanding how cgroup v2 works and does fair share scheduling.

In cgroup v2, there is a single ('unified') hierarchy of cgroups and processes in them. However, a given spot in the hierarchy may not have all of the available resource controllers enabled in it, as covered in the "enabling and disabling" section of the documentation. Enabling a controller is potentially important because, as described there:

Enabling a controller in a cgroup indicates that the distribution of the target resource across its immediate children will be controlled. [...]

The inverse is true; if a controller is not enabled, the distribution of the resource to the cgroup's children is not being controlled.

The list of controllers that are being applied to immediate children is in cgroup.subtree_control. The presence of a particular controller in that causes settings files related to the controller to show up in the immediate children, which means that looking for those settings files in a child (such as 'user.slice/user-1000.slice') is a reliable indicator that the parent (ie, 'user.slice') has the controller enabled.

Control of CPU scheduling is handled by the cgroup v2 'cpu' controller. Fair share scheduling is handled through a weight-based CPU time distribution model (although you can also impose usage limits). When the CPU controller is enabled for a cgroup, a cpu.weight file appears in all children (with a default value of 100). When distributing CPU time to the children, all of their cpu.weight values are summed up and then each active child gets CPU in proportion to their weight relative to the total. This means that if all cpu.weight files have the same value, all children will get equal shares of the CPU time. The actual cpu.weight values only matter if they're different; if they're all the same, the value is arbitrary.

(All of this is assuming that the CPU is saturated. If not all of the CPU is being used, everyone gets as much of it as they want.)

If you're using systemd and you merely want fair share scheduling between all users, the state (and change) you want is for user.slice to have the 'cpu' controller enabled. Once the controller is enabled, all individual user slices will acquire a default cpu.weight, and since it will all be the same, CPU will be shared evenly across all active users.

However, at least in the cgroup v2 world this has an additional implication. You can't enable a controller in a child without it also being enabled in the parent, so enabling the 'cpu' controller in user.slice implies that it is also enabled in the root cgroup. In turn, this implies that cgroup v2 will do fair-share CPU scheduling across all direct children of the root. In a systemd based system, this means that user.slice as a whole will (by default) be fair share scheduled against system.slice, and also against any virtual machines or containers that you're running (which are under machine.slice). If you have a bunch of users trying to use CPU and also some important CPU-consuming system services (or virtual machines), this may limit the CPU usage of the latter in ways that you don't want, always giving the users half the available CPU time (or a third, if you have all of users, virtual machines, and system services burning up CPU).

If you don't want this to happen, I believe that your only option is to adjust cpu.weight so that, for example, system.slice has a much higher weight than user.slice. This will tend to give system services more of the CPU when there's contention between them and user processes. Since weighted resources are distributed through the hierarchy, there's no way to put everything in a big pool together to all get equal shares. If you have one VM, two system services, and three users all trying to use all the CPU and you're using default weights in fair share scheduling, the VM will get 1/3rd, the two system services will collectively get 1/3rd (so 1/6th each if you're doing fair share scheduling at that level), and the three users will collectively get 1/3rd (and so 1/9th each).

Another consequence of all of this is that it's not possible to limit the CPU usage of a cgroup (for example, a systemd service or user) without also enabling fair share scheduling, since both are done by the 'cpu' controller. In an ideal world maybe you could turn fair share scheduling off by writing a 0 to cpu.weight, but currently you can't (the minimum value is 1). This means that if you want to limit the CPU usage of users, you're getting fair share scheduling between users and system services for "free", whether or not you want it (and also fair share scheduling between users, but you probably don't object to that).

Written on 31 May 2022.
« Systemd memory limits and strict memory overcommit
Setting up Linux fair share CPU scheduling with systemd and cgroup v2 »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Tue May 31 21:33:03 2022
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.