Things I will do differently in the next building power shutdown (part 2)
May 9, 2012
Back at the start of last September, we had an overnight building wide power shutdown in the building with our machine room and I wrote a lessons-learned entry in the aftermath. Well, we just had another one and apparently I didn't learn all of the lessons that I needed to learn the first time around. So here's another set of things that I've now learned.
Next time around I will:
My entry from last time was very useful in several ways. I reread it when I was preparing our checklist for this time and it jogged my memory about several important issues; as a result our checklist for this time around was (I think) significantly better than for last time (and also noticeably longer and more verbose). This time I at least made new mistakes, which is progress that I can live with.
I will also probably try to put more explanation into the checklist the next time around. I'm sure it's possible to put too much of it in, but I don't think that's been our problem so far. In the heat of the moment we're going to skim anyways, so the thing to do is to break the checklist up into skimmable blocks with actions and things to check off and then chunks of additional explanation after them.
(In a sense a checklist like this serves two purposes at once. During the power down or power up it is mostly a catalog of actions and ordering, but beforehand it's a discussion and a rationale for what needs to be done and why. Without the logic behind it being written out explicitly, you can't have that discussion; once you have that logic written out, you might as well leave it in to jog people's memories on the spot.)
On a side note, a full power up is an interesting and useful way to find problematic dependencies that have quietly worked their way into your overall network, ones that are not so noticeable when your systems are in their normal steady state. For example, DHCP service for several of our networks now depends on our core fileserver, which means that it can only come up fairly late in the power up process. We're going to be fixing that.
(There is a chain of dependencies that made this make sense in a steady state environment.)
Comments on this page:
* * *
Atom feeds are available; see the bottom of most pages.