Linux has no fair-share scheduling that really works for compute servers

November 20, 2021

I was recently contacted by someone who has a small group of compute servers and wanted a simple way to do some sort of fair share scheduling for them, without the various overheads of an actual job allocation system like SLURM. This person was drawn to me because of my entry on how we do per-user CPU and memory resource limits on Ubuntu 18.04. Unfortunately the real answer to their questions is that you cannot really do useful resource management and fair-share scheduling of compute servers with only standard Linux facilities.

The problem, as always, is memory (specifically RAM). CPU can be dynamically allocated and reallocated in response to things like the number of people trying to use a machine, but memory can't really be in practice (and certainly Linux has no good mechanism to try to do so). Once a program has been allocated memory, it's a more or less done deal. Fair share scheduling requires dynamic flows of resources, not up front allocation, and memory is effectively allocated up front.

The simplest option is to decide that everyone gets to use one Nth of the machine's memory, for some suitable N, and then train your users to not log in to a machine that already has N people using it. The problem with this is people who log on to an otherwise idle machine and then are artificially restricted from using all of its memory, even though the rest of it is idle. This creates unhappy people and leaves your resources under-utilized, unless your machines are so actively used that there are always N people on each of them (and a different N people).

(This solution will work okay if your machines have far more memory than your users need or want for their jobs. At that point you can pick a maximum per-user memory usage that allows for a decent number of people on a machine at once, and that also doesn't constrain your people too much. Unfortunately we are not in this situation; some of our researchers really do want to run jobs that use large amounts of RAM.)

The hacky option is probably to have a very large amount of swap space (ideally on NVMe drives) and dynamically adjust the amount of RAM and swap space that people were allowed to use based on how many people are currently using the machine. When no one else is logged in, you get all of the machine's memory and no swap space; when one other person logs in you get half of the RAM and enough swap space so your programs don't die on the spot, and so on. One problem with this is that for compute jobs that really use all of their memory, you've just made them thrash to death. If your swap space is on NVMe, hopefully you haven't killed the rest of the system in the process.

The good solution is to allow people to reserve the resources they need up front, including memory, and then arrange to not overcommit your compute servers (and to limit people to what they reserved). You can do this with scripts if you want, but a simple implementation doesn't enforce any sort of fairness. To do fairness, you really need some sort of accounting and then a policy about how to assign priority and mediate conflicts. Doing exactly this sort of reservation, accounting, and priority allocation is one of the important jobs of a system like SLURM.

My personal view is that by the time you're thinking about how to implement the hacky option, you should give up and install SLURM. SLURM is sort of a pain to configure, but once you have it set up it's not too complicated to operate, it works well, and many people are already used to using it.

Sidebar: Our local solution is a hybrid

We have a few general usage compute servers where we use fair-share CPU scheduling but no memory limits, so a single person can use all of the RAM on the machine and effectively block other people. However, most of our compute servers are in our SLURM cluster, where people have to specifically reserve memory up front, can only have so many running jobs at once, and so on. If you want to do something right now and can take your chances, you can use a general compute server and it will probably work out. Otherwise, if you want more resources or more certainty or both, you need to use the SLURM cluster.

Written on 20 November 2021.
« Why your Go programs can surprisingly be dynamically linked
Why V7 Unix matters so much »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sat Nov 20 23:23:17 2021
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.