Things I will do differently in the next building power shutdown (part 2)

May 9, 2012

Back at the start of last September, we had an overnight building wide power shutdown in the building with our machine room and I wrote a lessons-learned entry in the aftermath. Well, we just had another one and apparently I didn't learn all of the lessons that I needed to learn the first time around. So here's another set of things that I've now learned.

Next time around I will:

  • explicitly save the previous time's checklist. If nothing else, the 'power up' portion makes a handy guide for what to do if you abruptly lose building power some day.

    (I sort of did this last time, not through active planning but just because I reflexively don't delete basically any of this sort of stuff. But I should do it deliberately and put it somewhere where I can easily find it, instead of just leaving it lying around.)

    Having last time's list isn't the end of the work, because things have undoubtedly changed since then. But it's a starting point and a jog to the memory.

  • start preparing the checklist well in advance, like more than a day beforehand. Things worked out in the end but doing things at the last moment was a bit nerve wracking.

    (There's always stuff to do around here and somehow it always felt like there was plenty of time right up until it was Friday and we had a Monday night shutdown.)

  • update and correct the checklist immediately afterwards to cover things that we missed. My entry from last time is kind of vague; I'm sure I knew the specifics I was thinking of at the time, but I didn't write them down so they slipped away. I was able to reconstruct a few things from notes and email in the wake of last time, but others I only realized in the aftermath of this one.

  • add explanatory notes about why things are being done in a certain order and what the dependencies are. Especially in the bustle of trying to get everything down or up as fast as possible, it's useful to have something to jog our minds about why something is the way it is and whether or not it's that important.

    (Our checklists for this sort of thing are not fixed; they're more guidelines than requirements. We deviate from them on the fly and thus it's really useful to have some indication of how flexible or rigid things are.)

  • if any machines are being brought down and then deliberately not being brought back up, explicitly mention this so that people don't get potentially confused about a 'missing' machine.

My entry from last time was very useful in several ways. I reread it when I was preparing our checklist for this time and it jogged my memory about several important issues; as a result our checklist for this time around was (I think) significantly better than for last time (and also noticeably longer and more verbose). This time I at least made new mistakes, which is progress that I can live with.

I will also probably try to put more explanation into the checklist the next time around. I'm sure it's possible to put too much of it in, but I don't think that's been our problem so far. In the heat of the moment we're going to skim anyways, so the thing to do is to break the checklist up into skimmable blocks with actions and things to check off and then chunks of additional explanation after them.

(In a sense a checklist like this serves two purposes at once. During the power down or power up it is mostly a catalog of actions and ordering, but beforehand it's a discussion and a rationale for what needs to be done and why. Without the logic behind it being written out explicitly, you can't have that discussion; once you have that logic written out, you might as well leave it in to jog people's memories on the spot.)

On a side note, a full power up is an interesting and useful way to find problematic dependencies that have quietly worked their way into your overall network, ones that are not so noticeable when your systems are in their normal steady state. For example, DHCP service for several of our networks now depends on our core fileserver, which means that it can only come up fairly late in the power up process. We're going to be fixing that.

(There is a chain of dependencies that made this make sense in a steady state environment.)


Comments on this page:

From 70.26.89.246 at 2012-05-09 19:37:00:

While not a replacement for a checklist, I find that having a monitoring system (e.g., Nagios) that is as independent as possible from the rest of the network is useful as well. (Also make sure you have one machine monitoring it as well so you know when it is down. :) Just turn off notification globally otherwise you'll be spammed.

There will be times when one forgets to update the list, and times when one forgets to add a service to Nagios, but hopefully the overlap of the two will be smaller than each independently.

Also, while a bit tedious at times, I've found setting up network dependencies to be useful in Nagios for partial outages as well:

http://nagios.sourceforge.net/docs/3_0/networkreachability.html

For large/important/central machines that may be powered up early in the list (and that everything else may depend on), it may be useful to record how long it takes to get going. Back in the day, we had a large-for-the-time Sun E3500 that took seventeen minutes to go from button press to log in prompt on the console. The first time we started up after a full shutdown we started getting nervous when it started taking so long (luckily it wasn't too central to the network and we could continue on checking on it every once in a while).

Written on 09 May 2012.
« Third party Linux kernel modules should build against non-running kernels
Using rsync to pull a directory tree to client machines »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed May 9 00:37:34 2012
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.