How to get per-user fair share scheduling on Ubuntu 16.04 (with systemd)

August 15, 2017

When I wrote up imposing temporary CPU and memory limits on a user on Ubuntu 16.04, I sort of discovered that I had turned on per-user fair share CPU scheduling as a side effect, although I didn't understand exactly how to do this deliberately. Armed with a deeper understanding of how to tell if fair share scheduling was on, I've now done a number of further experiments and I believe I have definitive answers. This applies only to Ubuntu 16.04 and its version of systemd as configured by Ubuntu; it doesn't seem to apply to, for example, a stock Fedora 26 system.

To enable per user fair share CPU scheduling, it appears that you must do two things:

  • First, set CPUAccounting=true on user.slice. You can do this temporarily with 'systemctl --runtime set-property' or permanently enable it.

  • Second, arrange to have CPUAccounting=true set on an active user slice. If you do this temporarily with 'systemctl --runtime', the user must be logged in with some sort of session at the time. If you do this permanently, nothing happens until that user logs in and systemd creates their user-${UID}.slice slice.

Once you've done both of these, all future (user) sessions from any user will have their processes included in per-user fair share scheduling. If you used 'systemctl --runtime' on a user-${UID}.slice, it doesn't matter if that user logs completely out and their slice goes away; the fair share scheduling sticks despite this. However, fair-share scheduling goes away if all users log out and user.slice is removed by systemd. You need at least one remaining user session at all times to keep user.slice still in use (a detached screen session will do).

If you want to force existing processes to be subject to per-user fair share scheduling, you must arrange to set CPUAccounting=true on all current user scopes:

for i in $(systemctl -t scope list-units |
           awk '{print $1}' |
           grep '^session-.*\.scope$'); do
    systemctl --runtime set-property $i CPUAccounting=true

This creates a slightly different cgroup hierarchy than you'll get from completely proper fair share scheduling, but the differences are probably unimportant in practice. In regular fair share scheduling, all processes from the same user are grouped together under user.slice/user-${UID}.slice, so they contend evenly with each other. When you force scopes this way, processes get grouped into their scopes, so they go in user.slice/user-${UID}.slice/session-<blah>.scope; as a result, a user's scopes also are fair-share scheduled against each other. This only applies to current processes and scopes; as users log out and then back in again, their new processes will be all grouped together.

If you have a sufficiently small number of users who will log in to your machines and run CPU-consuming things, it's feasible to create permanent settings for each of them with 'systemctl set-property user-${UID}.slice CPUAccounting=true'. If you have lots of users, as we do, this is infeasible; if nothing else, your /etc/systemd/system directory would wind up utterly cluttered. This means that you have to do it on the fly (and then do it again if all user sessions ended and systemd deleted user.slice).

This is where we run into an important limitation of per user fair share scheduling on a normally configured Ubuntu 16.04. As we've set fair-share scheduling up, this only applies to processes that are under user.slice; system processes are not fair-share scheduled. It turns out that user cron jobs don't run under user.slice and so are not fair-share scheduled. All processes created by user cron entries wind up all grouped together under cron.service; there is no per-user separation and nothing is put under user slices.

(It's possible that you can change this with PAM magic, but this is how a normal Ubuntu 16.04 machine behaves.)

I discovered this because I had the clever idea that I could use a root @reboot /etc/cron.d entry to set things on user.slice and user-0.slice shortly after the system booted. Attempting to do this led to the discovery that neither slice actually existed when my @reboot job ran, and that my process was under cron.service instead. As far as I can see there's no way around this; there just doesn't seem to be a systemd command that will run a command for you under a user slice.

(If there was, you could make a root @reboot crontab that ran the necessary systemctl commands and then didn't exit, so there would always be an active user slice so that user.slice wouldn't get removed by systemd.)

PS: My solution was to wrap up all of these steps into a shell script that we can run if we need to turn on fair-share scheduling on some machine because a bunch of users are contending over it. Such an on demand, on the fly solution is good enough for our case (even if it doesn't include crontab jobs, which is a real pity for some machines).

Comments on this page:

From at 2017-08-16 06:03:40:

Making cron use `pam_systemd` would help with the grouping problem, but OTOH it causes a new session to be created for every cronjob, which is less than optimal. So various distros might be reluctant to use it.

Running periodic stuff via `systemctl --user` timer units might be an alternative (the --user instance has a persistent slice alongside the regular sessions).

Written on 15 August 2017.
« Chrome extensions are becoming a reason not to use Chrome
Things I do and don't know about how ZFS brings pools up during boot »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Tue Aug 15 00:01:12 2017
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.