2017-09-03
A fundamental limitation of systemd's per-user fair share scheduling
Up until now, I've been casually talking about systemd supporting
per-user fair share scheduling, when writing about the basic
mechanics and in things like getting
cron jobs to cooperate. But really both of
these point out a fundamental limitation, which is that systemd
doesn't have per-user fair share scheduling; what it really has is
per-slice fair share scheduling. You can create per-user fair
share scheduling from this only to the extent that you can arrange
for a given user's processes to all wind up somewhere under their
user-${UID}
slice. If you can't arrange for all of the significant
processes to get put under user-${UID}.slice
, you don't get
complete per-user fair share scheduling; some processes will escape
to be scheduled separately and possibly (very) unfairly.
This may sound like an abstract limitation, so let me give you a concrete case where it matters. We run a departmental web server, where users can run processes to handle web requests in various ways, both via CGIs and via user-managed web servers. Both of these can experience load surges of various sorts and sometimes this can result in them eating a bunch of CPU. It would be nice if user processes could have their CPU usage shared fairly among everyone, so that one user with a bunch of CPU-heavy requests wouldn't starve everyone else out of the CPU.
User-managed web servers run either from cron with @reboot
entries
or manually by the user logging in and (re)starting them; in both
cases we can arrange for the processes to be under user-${UID}.slice
and so be subject to per-user fair share scheduling. However, user
CGIs are run via suexec and suexec
doesn't use PAM (unlike cron); it just
directly changes UID to the target user. As a result, all suexec
CGI processes are found in apache2.service
under the system slice,
and so will never be part of per-user fair share scheduling.
(Even if you could make suexec use PAM and so set up systemd sessions for CGIs it runs if you wanted to, it's not clear that you'd want to be churning through that many session scopes and perhaps user slice creations and removals. I'm honestly not sure I'd trust systemd to be resilient in the face of creating huge numbers of very short-lived sessions, especially many at once if you get a load surge against some CGIs.)
As far as I can see, there's no way to solve this within the current
state of systemd, especially for the case of CGIs. Systemd would
probably need a whole new raft of features (likely including having
the user-${UID}.slice
linger around even with no processes under
it). Plus we'd need a new version of suexec that explicitly got
systemd to put new processes in the right slices (or used PAM so a
PAM module could do this).
Sidebar: This is also a general limitation of Linux
Linux has chosen to implement per-user fair share scheduling through a general mechanism to do fair share scheduling of (c)groups. Doing it this way has always required that you somehow arranged for all user processes to wind up in a per-user cgroup (whether through PAM modules, hand manipulation when creating processes, or a daemon that watched for processes that were in the wrong spot and moved them). If and when processes fell through the cracks, they wouldn't be scheduled appropriately. If anything, systemd makes it easier to get close to full per-user fair share scheduling than previous tools did.