Emergency procedures checklists need check steps

February 5, 2010

Given my previous entry, here is a thesis about emergency procedure documentation: you shouldn't just have a checklist for what to do, your checklist should include actual check steps, points where you stop to explicitly confirm that you've done something and it actually works.

Checklists are a good idea, but the common form of a checklist is just a list of steps to be carried out. Under the stress of an emergency situation, I don't think that this is good enough. First, your checklist implicitly assumes that everything works right, and second, it's too easy to be rushed, distracted by some interruption, sleep-deprived, or whatever while you're going through the checklist and lose track of where exactly you are, miss-do something, or miss the potentially subtle signs that something is not working the way that your checklist assumes.

Thus, you need spots in your checklist where you not do things but check things; you take positive steps to make sure that everything is as it should be and that the system is in the state that you and your checklist assume that it is. These checks insure that if something goes wrong, either in the environment or in you carrying out the checklist, that it gets noticed before things go horribly off the rails and explode.

In short: it's not good enough to have a checklist item that says 'throw switch 12'; you need something to confirm that you have in fact thrown switch 12 (and ideally just switch 12) and that the results of throwing switch 12 are what you expect.

You need these checks to be explicit steps in your checklist for the same reason that you have a checklist in the first place; your memory is fallible, especially under stress, and having them written down explicitly maximizes the chances that you will always do this.

(I suspect that one of the lessons that the airline industry can teach system administration is that in this sort of situation it is best to have two people involved, one reading off the checklist and the other one performing the actions and verbally confirming that they've been done. This makes it harder to fool yourself that something has been done or that of course something looks right.)

The corollary to this corollary is that checks should especially be inserted before you about to do damaging operations such as formatting a disk, putting a replacement system online under its production IP address, or force-importing a SAN filesystem on a non-default fileserver.

(Sadly, testing checks is probably even harder than testing documentation normally is; how do you manufacture failures in checklist steps to make sure that your check steps actually do anything useful?)

Written on 05 February 2010.
« Outdated documentation is especially risky for sysadmins
Why a laptop is not likely to be my primary machine any time soon »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Feb 5 01:15:16 2010
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.