Things I will do differently in the next building power shutdown

September 1, 2011

We recently had an overnight, building-wide power shutdown in the building with our machine room. As you can imagine, a total machine room shutdown (and later restart) is an interesting time. We made checklists for both the shutdown and the restart, and for the most part things went fine (although they took longer than expected). But still, there are a few things that I will do differently the next time that this happens:

  • make a list of all of our machines and then go through the checklists making sure that each machine is covered in both, either by name or as part of a generic group like 'all fileservers' or 'all Ubuntu machines'. We left out specifically covering a few machines, which led to uncertainty about when they were intended to be taken down and brought back.

  • for machines that are part of some generic category in the checklist (eg 'now shut down all fileservers'), print out a list of the machine names (in advance) so that they can be ticked off as you shut them down or bring them back up.

    (I forgot to do this for some generic categories and only noticed the omission after we'd shut down the print server, which led to me having to hand-write their names on my sheet.)

  • when preparing shutdown checklists, try to be sure to remember any odd bits of your network topology so that you don't shut down a gateway before the machines behind it. Our firewalls and their hot spares sit on an odd unrouted subnet that is reached through one of our general Ubuntu machines, and of course we shut down all of the Ubuntu machines early. This wasn't fatal, but it did make us feel kind of silly that we'd missed a chance to shut down all of the hot spares from the convenience of our offices.

    (We shut down the active firewalls very late, but the hot spares were pretty much unnecessary once our formal shutdown process started.)

  • ask yourself what important cron jobs won't get run due to the shutdown and if you need to do anything to run them by hand after you bring everything back or if the next cron run will automatically take care of things. As it turns out, our network traffic accounting system's daily aggregation process didn't get run and now needs to get fixed by hand.

  • explicitly list and then tick off unusual things to check, even if everyone is going to remember them. If nothing else, having them explicitly listed makes it much more likely that only a single person will check them instead of everyone remembering 'oh yeah, we need to check weird service <X>' and doing it separately.

    (When something is a step on the checklist, you're more likely to pause to at least ask co-workers if anyone has already done it.)

Next time I will also remember to take my printed copy of the checklist with me everywhere, no matter what, so that I can always tick off things on it if I find out that they're done or unnecessary or whatever. It may just be my peculiarity, but I find that I really like having physical paper and a pen so that I can literally tick off or cross out things to keep track of them. After events like this my printed checklists are always a sea of tick marks, crossed out bits, and cryptic notations.

(I don't have any particular consistency in how I mark up my checklists; I just do whatever makes sense for me at the time.)

(Having written this down, hopefully I will remember all of these good intentions when the next major machine room shutdown happens. Hopefully it won't be any time soon, although they have more major power work to do to our building at some point in the future.)

Written on 01 September 2011.
« An interesting debugging experience (another tale from long ago)
The core of modern Unix shared libraries »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Thu Sep 1 23:35:57 2011
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.