The value of automation having ways to shut it off (a small story)

January 20, 2020

We have some old donated Dell C6220 blades that we use as SLURM based compute servers. Unfortunately, these machines appear to have some sort of combined hardware and software fault that causes them to lock up periodically under some loads (building Go from source with full tests is especially prone to triggering it). Fortunately these machines support IPMI and so can be remotely power cycled, and a while back we got irritated enough at the lockups that we set up their IPMIs and built a simple cron-based set of scripts to do this for us automatically.

(The scripts take the simple approach of detecting down machines through looking for alerts in our Prometheus system. To avoid getting in our way, they only run outside of working hours; during the working day, if a Dell C6220 blade goes down we have to run the 'power cycle a machine via IPMI' script by hand against the relevant machine. This lets us deliberately shut down machines without having them suddenly restarted on us.)

All of these Dell C6220 blades are located in a secondary machine room that has the special power they need. Unfortunately, this machine room's air conditioner seems to have developed some sort of fault where it just stops working until you turn it off, wait a bit, and turn it back on. Of course this isn't happening during the working day; instead it's happened in the evenings or night (twice, recently). When this happens and we see the alerts from our monitoring system, we notify the relevant people and then power off all or almost all of the servers in the room, including the Dell C6220 blades.

You can probably see where this is going. Fortunately we thought of the obvious problem here before we started powering down the C6220 blades, so both times we just manually disabled the cron job that auto-restarts them. However, you can probably imagine what sort of problems we might have if we had a more complex and involved system to automatically restart nodes and servers that were 'supposed' to be up; in an unusual emergency situation like this, we could be fighting our own automation if we hadn't thought ahead to build in some sort of shutoff switch.

Or in short, when you automate something, think ahead to how you'll disable the automation if you ever need to. Everything needs an emergency override, even if that's just 'remove the cron job that drives everything'.

It's fine if this emergency stop mechanism is simple and brute force. For example, our simple method of commenting out the cron job is probably good enough for us. We could build a more complex system (possibly with finer-grained controls), but it would require us to remember (or look up) more about how to shut things off.

We could also give the auto-restart system some safety features. An obvious one would be to get the machine room temperature from Prometheus and refuse to start up any of the blade nodes if it's too hot. This is a pretty specific safety check, but we've already had two AC incidents in close succession so we're probably going to have more. A more general safety check would be to refuse to turn on blades if there were too many down, on the grounds that a lot of blades being down is almost certainly not because of the problem that the script was designed to deal with.

Written on 20 January 2020.
« Python 2, Apache's mod_wsgi, and its future in Linux distributions
Why I've come to like that Go's type inference is limited »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Jan 20 23:54:20 2020
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.