Wandering Thoughts archives

2021-11-20

Linux has no fair-share scheduling that really works for compute servers

I was recently contacted by someone who has a small group of compute servers and wanted a simple way to do some sort of fair share scheduling for them, without the various overheads of an actual job allocation system like SLURM. This person was drawn to me because of my entry on how we do per-user CPU and memory resource limits on Ubuntu 18.04. Unfortunately the real answer to their questions is that you cannot really do useful resource management and fair-share scheduling of compute servers with only standard Linux facilities.

The problem, as always, is memory (specifically RAM). CPU can be dynamically allocated and reallocated in response to things like the number of people trying to use a machine, but memory can't really be in practice (and certainly Linux has no good mechanism to try to do so). Once a program has been allocated memory, it's a more or less done deal. Fair share scheduling requires dynamic flows of resources, not up front allocation, and memory is effectively allocated up front.

The simplest option is to decide that everyone gets to use one Nth of the machine's memory, for some suitable N, and then train your users to not log in to a machine that already has N people using it. The problem with this is people who log on to an otherwise idle machine and then are artificially restricted from using all of its memory, even though the rest of it is idle. This creates unhappy people and leaves your resources under-utilized, unless your machines are so actively used that there are always N people on each of them (and a different N people).

(This solution will work okay if your machines have far more memory than your users need or want for their jobs. At that point you can pick a maximum per-user memory usage that allows for a decent number of people on a machine at once, and that also doesn't constrain your people too much. Unfortunately we are not in this situation; some of our researchers really do want to run jobs that use large amounts of RAM.)

The hacky option is probably to have a very large amount of swap space (ideally on NVMe drives) and dynamically adjust the amount of RAM and swap space that people were allowed to use based on how many people are currently using the machine. When no one else is logged in, you get all of the machine's memory and no swap space; when one other person logs in you get half of the RAM and enough swap space so your programs don't die on the spot, and so on. One problem with this is that for compute jobs that really use all of their memory, you've just made them thrash to death. If your swap space is on NVMe, hopefully you haven't killed the rest of the system in the process.

The good solution is to allow people to reserve the resources they need up front, including memory, and then arrange to not overcommit your compute servers (and to limit people to what they reserved). You can do this with scripts if you want, but a simple implementation doesn't enforce any sort of fairness. To do fairness, you really need some sort of accounting and then a policy about how to assign priority and mediate conflicts. Doing exactly this sort of reservation, accounting, and priority allocation is one of the important jobs of a system like SLURM.

My personal view is that by the time you're thinking about how to implement the hacky option, you should give up and install SLURM. SLURM is sort of a pain to configure, but once you have it set up it's not too complicated to operate, it works well, and many people are already used to using it.

Sidebar: Our local solution is a hybrid

We have a few general usage compute servers where we use fair-share CPU scheduling but no memory limits, so a single person can use all of the RAM on the machine and effectively block other people. However, most of our compute servers are in our SLURM cluster, where people have to specifically reserve memory up front, can only have so many running jobs at once, and so on. If you want to do something right now and can take your chances, you can use a general compute server and it will probably work out. Otherwise, if you want more resources or more certainty or both, you need to use the SLURM cluster.

linux/FairShareComputeImpossible written at 23:23:17; Add Comment

Why your Go programs can surprisingly be dynamically linked

Recently I read Julia Evans' Debugging a weird 'file not found' error, where the root problem was that a Linux Go program that Evans expected to be statically linked (because Go famously produces statically linked binaries) was instead dynamically linked and running in an environment without its required ELF interpreter. Although Go defaults to producing static executables when it can, winding up with a dynamically linked Go program on Linux is surprisingly common. I gave one version of the story in a tweet:

The Go standard library can need to call libc functions for a few things that it can't fully emulate, like looking up users/groups and doing hostname resolution (both can maybe require dynamically loaded NSS shared libraries for eg LDAP or mDNS). Disabling CGO turns this off.

Although it's not officially spelled out in cgo's documentation, it's well known that if you use CGO, your Go program will normally be dynamically linked against the C library. People widely assume the inverse of this; if you don't use and enable CGO (by setting CGO_ENABLED=1), you don't get CGO and so your Go program will be statically linked (well, on Linux, where Go directly makes system calls itself instead of going through the C library).

However, there are some functions in the Go standard library that intrinsically have to use the platform C library in order to work fully correctly all of the time. The largest case is anything that looks up some sort of information that goes through NSS, which can require loading and calling arbitrary C shared objects. As of Go 1.17, the two sorts of things that do are various network related lookups and user (or group) lookups. Both the os/user package and net package's section on Name Resolution mention this in their documentation, but not prominently or clearly. Each says some variant of 'when cgo is available, the cgo-based version may be used'. To simplify slightly, CGO is availble if it hasn't been specifically disabled by setting 'CGO_ENABLED=0' and you're building natively (on Linux itself.

(CGO may also be available if you're cross-compiling and have set up a relatively complex environment. Simple Go-based cross compilation doesn't normally have CGO available.)

If you don't have CGO disabled and you directly or indirectly use either net or os/user, you'll normally wind up with a dynamically linked Go executable. This executable won't necessarily actually call the C library when your program looks up hostnames (cf), but the mere possibility of needing to do it forces the dynamic linking and thus makes the program depend on the ELF interpreter for the C library you're using. Since a lot of Go programs wind up doing some sort of networking, a lot of Go programs wind up dynamically linked on Linux unless people go out of their way to avoid it.

If you want to see all the various sorts of things in the net package that can wind up making C library calls, see net/cgo_unix.go and possibly net/lookup_unix.go, which calls the stuff from cgo_unix.go under various circumstances.

PS: In the Go 1.17 toolchain (and probably in future ones), merely importing the net package will trigger this dynamic linking, even if you never call anything from it. Evidently the CGO status is a per-package thing that doesn't depend on what code you use from the package.

programming/GoWhyNotStaticLinked written at 00:46:02; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.