Things I will do differently in the next building power shutdown
We recently had an overnight, building-wide power shutdown in the building with our machine room. As you can imagine, a total machine room shutdown (and later restart) is an interesting time. We made checklists for both the shutdown and the restart, and for the most part things went fine (although they took longer than expected). But still, there are a few things that I will do differently the next time that this happens:
- make a list of all of our machines and then go through the checklists
making sure that each machine is covered in both, either by name
or as part of a generic group like 'all fileservers' or 'all
Ubuntu machines'. We left out specifically covering a few machines,
which led to uncertainty about when they were intended to be taken
down and brought back.
- for machines that are part of some generic category in the checklist
(eg 'now shut down all fileservers'), print out a list of the
machine names (in advance) so that they can be ticked off as you
shut them down or bring them back up.
(I forgot to do this for some generic categories and only noticed the omission after we'd shut down the print server, which led to me having to hand-write their names on my sheet.)
- when preparing shutdown checklists, try to be sure to remember any odd
bits of your network topology so that you don't shut down a gateway
before the machines behind it. Our firewalls and their hot spares
sit on an odd unrouted subnet that
is reached through one of our general Ubuntu machines, and of
course we shut down all of the Ubuntu machines early. This wasn't
fatal, but it did make us feel kind of silly that we'd missed a
chance to shut down all of the hot spares from the convenience
of our offices.
(We shut down the active firewalls very late, but the hot spares were pretty much unnecessary once our formal shutdown process started.)
- ask yourself what important cron jobs won't get run due to the
shutdown and if you need to do anything to run them by hand after
you bring everything back or if the next cron run will automatically
take care of things. As it turns out, our network traffic accounting
system's daily aggregation process didn't get run and now needs
to get fixed by hand.
- explicitly list and then tick off unusual things to check, even
if everyone is going to remember them. If nothing else, having
them explicitly listed makes it much more likely that only a
single person will check them instead of everyone remembering 'oh
yeah, we need to check weird service <X>' and doing it separately.
(When something is a step on the checklist, you're more likely to pause to at least ask co-workers if anyone has already done it.)
Next time I will also remember to take my printed copy of the checklist with me everywhere, no matter what, so that I can always tick off things on it if I find out that they're done or unnecessary or whatever. It may just be my peculiarity, but I find that I really like having physical paper and a pen so that I can literally tick off or cross out things to keep track of them. After events like this my printed checklists are always a sea of tick marks, crossed out bits, and cryptic notations.
(I don't have any particular consistency in how I mark up my checklists; I just do whatever makes sense for me at the time.)
(Having written this down, hopefully I will remember all of these good intentions when the next major machine room shutdown happens. Hopefully it won't be any time soon, although they have more major power work to do to our building at some point in the future.)
|
|