2021-11-26
How we use the SLURM job scheduler system on our compute servers
The Slurm Workload Manager, often called just SLURM, is what a lot of supercomputers and big compute clusters use to manage scheduling and executing all of the jobs that people always want to run on them. However you don't have to use it in such grand environments, and we use it in a much more modest one, with a relatively simple usage as experienced by most people. We've actually had two iterations of our SLURM environment, one that I described in 2019 and the current one.
Our motivation for using SLURM at all is that we have a pool of compute servers of varying capacity, and some GPU servers as well. A few of these compute servers are general login servers, but the problem with these is that they're a free for all; anyone can log in at any time and start using CPU (and perhaps memory, although that can't be fair-share scheduled so it's first come, first served). Traditionally people have wanted to reserve some dedicated amount of resources that are theirs for some amount of time. Well, SLURM does that.
As experienced by our researchers, our SLURM setup lets them reserve some amount of cores and memory on a compute server for a job for up to a maximum time (and also GPUs, if they want those and the machine has them). Sometimes people will be picky about what sort of machine (or what specific machine) they want to use, and other times they'll take whichever of our highly varied set of compute servers has enough resources free at the moment and SLURM decides to use. When SLURM grants your allocation, which may be immediately, you can run an interactive login session on the compute server or just run a program or script. Your program or session gets access to however many resources you asked for and got given, and no more. If the compute server has left over resources, someone else can allocate them (or you can allocate them for another job).
(When a job ends, perhaps because it hit the maximum number of days we allow a single job to run for, SLURM terminates all of that job's processes on the compute server.)
Researchers can have more than one job active at once, up to e relatively large limit. Internally, SLURM keeps track of what resources people have used recently and uses this information to do fair scheduling when all of our compute or GPU servers are busy for long enough, so that everyone who is competing for resources gets roughly the same amount over the long term. This does have some caveats; basically, our SLURM setup always lets you use resources that are currently free, so in order to get fair scheduling, people have to be willing to submit jobs that won't execute immediately.
(In practice our compute servers generally aren't all busy, and when they are all busy it's usually not for very long.)
Our researchers can in theory use our SLURM setup for a lot of additional, more sophisticated things, but in practice 'run something on a compute server with some guaranteed resources' is the big usage. I suspect that most researchers never bother to look deeper in our documentation and the SLURM manual pages than how to make requests, see what compute servers are available and what they have, and maybe see what jobs are active or queued up.
Sidebar: Our changing SLURM setup over time
Our first SLURM setup tried to use SLURM to allocate entire compute servers at a time, which was imitating our previous manual system. Even at the time I noted that this wasn't how SLURM wanted to work, and eventually that mismatch became too much. Our current setup lets SLURM allocate and schedule fine grained resources the way it wants to, which means that people can be allocated only part of a machine and they have to figure out how much resources to ask for for their jobs.
Why region based memory allocation help with fragmentation
Recently, I read an advocacy article on Go's garbage collection as compared to Java (via, which has important criticisms of the technical content, also). One of the things it mentioned is that Go's region based memory allocation reduces memory fragmentation. This struck me as both correct and not entirely obvious, so I want to talk about why region allocation helps.
A classical simple memory allocator maintains a pool of blocks of unallocated memory of various sizes. When you ask to allocate some amount of memory, in the ideal case there is a free block of exactly the right size and you get it. If there isn't, the allocator breaks up an existing free block that is larger than you need, handing you the now-allocated part and putting the remaining smaller block of free memory back in the pool. When you free memory again, the allocator adds it to the pool and then attempts to merge it with adjacent blocks to make a larger block of free memory. If you ask the allocator for a larger amount of memory than it has in a free block (for example, you ask for 64 Kb when the largest free blocks are 32 Kb), the allocator gets more memory from the operating system and returns some or all of it to you as your allocation.
(The simple but often slow way to maintain the pool of free memory blocks is as an ordered linked list.)
This simple memory allocator is vulnerable to fragmenting memory over time, because it starts out with large blocks, breaks them up as you ask for various amounts of memory, and can only merge freed allocations back into the original large blocks if everything in the block is free. If you start with a 64 Kb block, allocate all of it in various sized chunks, and then free most of it, you might be unlucky and wind up with small remaining allocated pieces that prevent you from having any large contiguous runs of free memory (you could also be lucky). This mixture is quite possible because consecutive chunks of memory are often allocated in roughly time order, and programs ofter intermix small and large allocations (first they want a small chunk, then a big chunk, then a few small chunks again, and so on).
Region allocators (such as Go has) break up (free) memory into regions where each region is dedicated to a single (rounded) size of object allocation, and these fixed size objects are arranged so that none of them cross page boundaries or always occupy a fixed number of pages (usually people assume 4 Kb pages). This makes allocating and freeing objects easier, automatically groups similar objects together, and allows relatively flexible reuse of free memory pages (a free page can be allocated to any small object size class).
Because region allocation never repeatedly breaks up a large block of memory for different allocations (especially different sized ones), you can't wind up in the original situation of a large amount of free memory that can't be merged together into a large block because there are a few allocated spots in the middle. Those allocated spots would be in their own regions, not in the middle of a valuable large block of a different (large sized) region. Region allocators do especially well at keeping small allocations from interfering with the free memory used for larger ones, even (or especially) when the program intermixed the allocation of the various sizes. The program may intermix its requests for various sizes, but the region allocator separates them out again.
Because programs often allocate a lot of objects that are page sized or less, region allocators can often usefully reuse even relatively modest chunks of free memory. If the region allocator works at the level of single pages (instead of larger blocks of them), a free page can be recycled for any size of small objects that happens to be in demand at the moment.
Overall, I think we can say that region allocation reduces fragmentation by making the order of allocating and freeing memory less important. If you intermix allocating a bunch of different sized objects and then don't free all of them (or delay freeing them for a long time), in a simple allocator you wind up with allocated holes in your free ranges. In a region allocator, those different sized allocations go to different regions, and failing to free all of the objects of one size (in one region) doesn't cause problems for other regions of other sizes.
(Region allocators have their own forms of fragmentation, but it's often harder to have happen. A discussion of region allocator fragmentation is beyond the scope of this entry.)