cgroups: so close and yet so far away from per-user fair scheduling
June 17, 2011
Suppose, not entirely hypothetically, that you have some shared, multiuser compute servers. Further suppose that sometimes, person A is running a single compute process while person B is running, oh, nine. Since Linux divides CPU time among processes without caring who owns them, this means that person A is getting 1/10th of the CPU while person B is getting 9/10ths of it. This doesn't seem entirely fair; it would be better if A and B split the CPU 50/50 regardless of how many compute jobs each ran.
(Since these are all multi-CPU machines, the real examples are more complicated. But yes, periodically there are more compute jobs than there are cores.)
Modern Linux kernels come with support for cgroups, which is designed to enable this sort of stuff. I'll cut to the chase: cgroups can at least in theory do exactly the per-user fair scheduling that we want here. In practice Linux is let down by the current state of the user tools, which lack the features you need to make this feasible.
How to use cgroups to create per-user fair scheduling is pretty
straightforward; you just put each user into their own cgroup (or at
least each real user, you might want to do something different with
system daemon UIDs) and give each user cgroup the same
The kernel support for all of this is there, as is most of the user level tools you'd need (in the form of libcg and associated programs); there's even a PAM module, which classifies users into cgroups based on a configuration file or two. However, what the tools don't have is any ability to have generic entries in the configuration files for creating cgroups and assigning users to them. If you want to have one cgroup per user, you get to write them out explicitly (and then the tools will create them all ahead of time). Oh sure, you can generate the config files with a script, but you also have to poke various daemons every time you want your config file changes to take effect. Things get annoying fast.
(I also wonder how happy the kernel will be to have a thousand or so cgroups, almost all of which are unused at any given time (given that only a handful of our users will log on to a compute server at once).)
PS: the tragic thing is that a hard-coded PAM module would be almost trivial (and I've written PAM modules before). But that would mean building and maintaining a custom PAM module, and this issue is not quite important enough here to justify that.
(Like most sysadmins, we get a modest amount of hives at locally developed software. The closer we can be to stock systems the happier we are, because it means that someone else is maintaining the software.)
* * *