We might want to regularly keep track of how important each server is

February 5, 2024

Today we had a significant machine room air conditioning failure in our main machine room, one that certainly couldn't be fixed on the spot ('glycol all over the roof' is not a phrase you really want to hear about your AC's chiller). To keep the machine room's temperature down, we had to power off as many machines as possible without too badly affecting the services we offer to people here, which are rather varied. Some choices were obvious; all of our SLURM nodes that were in the main machine room got turned off right away. But others weren't things we necessarily remembered right away or we weren't clear if they were safe to turn off and what effects it would have. In the end we took several rounds of turning servers off, looking at what was left, spotting remaining machines, and turning more things off, and we're probably not done yet.

(We have secondary machine room space and we're probably going to have to evacuate servers into it, too.)

One thing we could do to avoid this flailing in the future is to explicitly (try to) keep track of which machines are important and which ones aren't, to pre-plan which machines we could shut down if we had a limited amount of cooling or power. If we documented this, we could avoid having to wrack our brains at the last minute and worry about dependencies or uses that we'd forgotten. Of course documentation isn't free; there's an ongoing amount of work to write it and keep it up to date. But possibly we could do this work as part of deploying machines or changing their configurations.

(This would also help identify machines that we didn't need any more but hadn't gotten around to taking out of service, which we found a couple of in this iteration.)

Writing all of this just in case of further AC failures is probably not all that great a choice of where to spend our time. But writing down this sort of thing can often help to clarify how your environment is connected together in general, including things like what will probably break or have problems if a specific machine (or service) is out, and perhaps which people depend on what service. This can be valuable information in general. The machine room archaeology of 'what is this machine, why is it on, and who is using it' can be fun occasionally, but you probably don't want to do it regularly.

(Will we actually do this? I suspect not. When we deploy and start using a machine its purpose and so on feel obvious, because we have all of the context.)


Comments on this page:

By Arnaud Gomes at 2024-02-06 03:25:03:

We had the same kind of issues at a previous workplace, we ended up writing a "reboot the machine room HOWTO". It forced us to identify dependencies, and anything in the bottom half of the list was probably not essential.

   -- A

If it is decided to try to document hardware, Netbox is a pretty good and worth looking at:

By goatops at 2024-02-06 07:25:48:

We set up a simple color-coded sticker scheme for our ops team to follow in the event of a power failure where we wanted to maximise available power from the UPS. Each server was affixed with a green (can be shut down immediately), orange (can be shut down with notice) or red (try not to shut down at all) sticker on the front. Worked pretty well for us.

By Anonymous at 2024-02-06 09:22:31:

Perhaps (slightly) related: the first documented time people thought about this formally (as far as I can remember), was in this paper at Usenix/LISA '98: "Bootstrapping an Infrastructure" (https://www.usenix.org/legacy/event/lisa98/traugott.html) which might or might not still be relevant today.

By Milo at 2024-02-06 12:45:16:

We also don't want to end up like this person quoted on bash.org (which seems to have gone offline recently):

<erno> hm. I've lost a machine.. literally _lost_. it responds to ping, it works completely, I just can't figure out where in my apartment it is.

By Miksa at 2024-02-07 09:54:03:

We had a similar experience few years back, but for a sillier reason. On a Sunday morning one of our datacenters started approaching 60C, bunch of servers had already turned themselves off preemptively and few of us showed to investigate and open doors for extra cooling. Sitting at the door we started pondering what could be the reason that 3 out 4 cooling units were turned off. We surmised that some kind of building automation controls the cooling based on temperature sensors and one way or another the data from the sensors needs to be transmitted out. One of us got a recollection or hunch that the cabinet with a door in the datacenter wall might have something to do with it and we decided to take a look.

Inside we find a small router-looking device and a small Eaton UPS. The device was off, an indication of a dead UPS, so we decided to try what happens if we unplugget the power cords from the UPS and connected them to each other. The device came alive and soon after the cooling units started turning back on.

The datacenter has an UPS the size of a large room, and it all comes crashing down because of a little UPS no one even remember existed.

This experience was a big intensive to go through the process you are considering. Our goal was to produce lists with startup order for the servers in case a datacenter had gone down. The first phase was documenting the role (dev/test/prod/administration) and priority tier (1-5) for all servers. We already had this information, but it was quite spotty. Annoying ordeal but not too bad in the end. Create lists, couple op staff go through them and add their opinions, then as a full group comb the list and negotiate an educated guess for all of them. Takes few hours but is worth it. A parallel task was to modify our ochestration tool to ask for this info for all new servers.

Next phase was to create script that creates list based on this information and the rack location of the servers. A great help for this was our scripted maintenance windows. Many frontend servers had scripts that make them wait until some other server has finished booting and some service, usually database, is back online. The backend server would automatically get at least .5 higher priority than the front. Then just a matter of uploading these lists to a website. Biggest remaining obstacle is regularly printing these lists to the datacenters.

By Milo at 2024-02-07 11:26:17:

In aviation, there are checklists for nearly everything. Regular maintenance, pre-boarding inspections, power-up, take-off, and of course the abnormal ones like "engine failure" and "rapid decompression". There have been some efforts to bring this mindset into other fields such as surgery. One of the major difficulties is trying to keep these short and helpful, rather than something that's accumulated "cruft" and is perceived as a clock-gobbling chore.

I wonder if that would be useful for system administrators. Like, a cooling failure checklist, post-power-outage checklist, DNS failure checklist, as so on. A big benefit in this field would be that many of the tasks, such as "new server bringup" and most things related to monitoring, could be scripted.

By cks at 2024-02-07 14:10:32:

I'm a big fan of checklists, but at the same time I think there are real issues. In general, documentation isn't free and checklists are a form of documentation. For checklists related to failures, there's the additional issue that documentation needs testing, which can be hard to do if you need an actual failure or a sufficiently accurately simulated one to test your checklist with (and it also takes time). In system administration, checklists generally can't be static things that are created once, because the environment is constantly changing; this means not just updating but re-checking and so on.

System environments are often sufficiently complicated that it's very hard to foresee all effects of a failure or all interactions that your systems have (some would say it's impossible). It's a classic story in for the field that 'we thought we understood everything and had mitigated everything, except surprise we hadn't'.

(Our checklists work best for routine things like installing machines and for exceptional events that we can consider carefully in advance, like planned power shutdowns.)

By Milo at 2024-02-07 16:04:09:

Chapter 6 ("The Checklist Factory") of The Checklist Manifesto makes similar points: they shouldn't be static, they're definitely not free (to create or use, and thus can't cover everything), and testing is needed. If you're unfamiliar, it might be worth a trip to your excellent university library.

There are good checklists and bad, Boorman explained. Bad checklists are vague and imprecise. They are too long; they are hard to use; they are impractical. They are made by desk jockeys with no awareness of the situations in which they are to be deployed. They treat the people using the tools as dumb and try to spell out every single step. They turn people’s brains off rather than turn them on.

Good checklists, on the other hand, are precise. They are efficient, to the point, and easy to use even in the most difficult situations. They do not try to spell out everything—a checklist cannot fly a plane. Instead, they provide reminders of only the most critical and important steps—the ones that even the highly skilled professionals using them could miss. Good checklists are, above all, practical.

The power of checklists is limited, Boorman emphasized. […]

[Testing] is not easy to do in surgery, I pointed out. Not in aviation, either, he countered. You can’t unlatch a cargo door in mid-flight and observe how a crew handles the consequences. But that’s why they have flight simulators, and he offered to show me one. […]

The three checklists took no time at all—maybe thirty seconds each—plus maybe a minute for the briefing. The brevity was no accident, Boorman said. People had spent hours watching pilots try out early versions in simulators, timing them, refining them, paring them down to their most efficient essentials.

I doubt it's practical for you to spend weeks observing rookies in sysadmin simulators. But when I helped with on-boarding of new employees, I found it helpful to refer them to a "new employee" wiki page and ask them to edit it or ask questions as necessary; when they came to me with a question I thought should have been covered there, but wasn't, I could add it while we were speaking. Same for employees with more tenure: if you spent a bunch of time figuring something out, and it'll take less time to document it, write it on the wiki. (We also had a very bureaucratic "official document" process—so bureaucratic that, quite frankly, most of us didn't know what the process for revising a document was... so we didn't do that, and hence few people who weren't ISO auditors ever looked at them.)

As for the general idea of simulation: simulating a network of servers would take significant time to set up, and would never have 100% fidelity (there's always some dusty old machine in a closet that no remaining employee knows is important), but maybe should be something to aspire to. If done very well, (almost) the whole environment could be deployed into virtual machines for (limited) failure testing; if done extraordinarily well, with budgets unattainable to most sysadmins, testing could involve unplugging production machines randomly (cf. Netflix's "Chaos Monkey").

Written on 05 February 2024.
« I switched to explicit imports of things in our Django application
What the max_connect Linux NFS v4 mount parameter seems to do »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Feb 5 23:14:53 2024
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.