Automating our 'bookable' compute servers with SLURM has created generic 'cattle' machines

October 6, 2019

I'll lead with the thing I realized. Several years ago I wrote about how all of our important machines were 'pets' instead of 'cattle'. One of the reasons for that was that people logged in to specific machines by name in order to use them, and so they cared if a particular machine went down (which is my view of the difference between pets and cattle). Due to recent changes in how we run a bunch of our compute servers, we've more or less transformed these compute servers into more or less cattle machines. So here's the story.

We have some general use compute servers, but one of the traditional problems with them has been exactly that they were general use. You couldn't get one to yourself and worse, your work on the machine could be affected by whatever else other people decided to run on it too (fair share scheduling helps with this somewhat, but not completely). So for years we also had what we called 'bookable' compute servers, where you could reserve a machine for yourself for a while. At first this started small, with only a few machines, but then it started growing (and we also started adding machines with GPUs).

This created a steadily increasing problem for us, because we maintained these bookings mostly manually. There was some automation to send us email when a machine's booking status had to change, but we had to enter all of the bookings by hand and do the updates by hand. At the start of everything, with only a few machines, there were decent reasons for this; we didn't want to put together a complicated system with a bunch of local software, and it's always dangerous to set up a situation where somewhat fuzzy policies about fairness and so on are enforced through software. By the time we had a bunch of machines, both the actual work and dealing with various policy issues was increasingly a significant burden.

Our eventual solution was to adopt SLURM, configured so that it didn't try to share SLURM nodes (ie compute servers) between people. This isn't how SLURM wants to operate (it'd rather be a fine-grained scheduler), but it's the best approach for us. We moved all of our previous bookable compute servers into SLURM, wrote some documentation on how to use SLURM to basically log in to the nodes, and told everyone they had to switch over to using SLURM whether they liked it or not. Once pushed, people did move and they're probably now using our compute servers more than ever before (partly because they can now get a bunch of them at once for a few days, on the spot).

(We had a previously operated a SLURM cluster with a number of nodes and tried to get people to move over from bookable compute servers to the SLURM cluster, without much success. Given a choice, most people would understandably prefer to use the setup they're already familiar with.)

This switch to allocating and managing access to compute servers through SLURM is only part of what has created genuine cattle; automated allocation of our bookable compute servers wouldn't really have had the same effects. Part of it is that how SLURM operates is that you don't book a machine and then get to log in to it; normally you run a SLURM command and you (or your script) are dumped onto the machine you've been assigned. When you quit or your script exits, your allocation is gone (and you may not be able to get the particular machine back again, if someone else is in the queue). And I feel the final bit of it is that we only let each allocation last for a few days, so no matter what you're getting interrupted before too long.

You can insist on treating SLURM nodes as pets, picking a specific one out to use and caring about it. But SLURM and our entire setup pushes people towards not caring what they get and using nodes only on a transient basis, which means that if one node goes away it's not a big deal.

(This is a good thing because it turns out that some of the donated compute server hardware we're using is a bit flaky and locks up every so often, especially under load. In the days of explicitly booked servers, this would have been all sorts of problems; now people just have to re-submit jobs or whatever, although it's still not great to have their job abruptly die part-way through.)

Written on 06 October 2019.
« The wikitext problem with new HTML elements such as <details>
Why we generate alert notifications about our machines having rebooted »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sun Oct 6 23:05:23 2019
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.