Emergency procedures checklists need check steps

February 5, 2010

Given my previous entry, here is a thesis about emergency procedure documentation: you shouldn't just have a checklist for what to do, your checklist should include actual check steps, points where you stop to explicitly confirm that you've done something and it actually works.

Checklists are a good idea, but the common form of a checklist is just a list of steps to be carried out. Under the stress of an emergency situation, I don't think that this is good enough. First, your checklist implicitly assumes that everything works right, and second, it's too easy to be rushed, distracted by some interruption, sleep-deprived, or whatever while you're going through the checklist and lose track of where exactly you are, miss-do something, or miss the potentially subtle signs that something is not working the way that your checklist assumes.

Thus, you need spots in your checklist where you not do things but check things; you take positive steps to make sure that everything is as it should be and that the system is in the state that you and your checklist assume that it is. These checks insure that if something goes wrong, either in the environment or in you carrying out the checklist, that it gets noticed before things go horribly off the rails and explode.

In short: it's not good enough to have a checklist item that says 'throw switch 12'; you need something to confirm that you have in fact thrown switch 12 (and ideally just switch 12) and that the results of throwing switch 12 are what you expect.

You need these checks to be explicit steps in your checklist for the same reason that you have a checklist in the first place; your memory is fallible, especially under stress, and having them written down explicitly maximizes the chances that you will always do this.

(I suspect that one of the lessons that the airline industry can teach system administration is that in this sort of situation it is best to have two people involved, one reading off the checklist and the other one performing the actions and verbally confirming that they've been done. This makes it harder to fool yourself that something has been done or that of course something looks right.)

The corollary to this corollary is that checks should especially be inserted before you about to do damaging operations such as formatting a disk, putting a replacement system online under its production IP address, or force-importing a SAN filesystem on a non-default fileserver.

(Sadly, testing checks is probably even harder than testing documentation normally is; how do you manufacture failures in checklist steps to make sure that your check steps actually do anything useful?)


Comments on this page:

From 78.35.25.22 at 2010-02-05 04:46:04:

This is pretty much the equivalent of “many scripts do not adequately check for and handle error conditions.”

Aristotle Pagaltzis

By cks at 2010-02-05 12:01:31:

I think of it as more than that; in terms of a shell script, it would be as if you not just checked for commands failing but periodically wrote stuff in the script that explicitly verified that the previous commands had put files where they claimed to or that magic markers were present in files and so on.

(One way to put it is that this isn't checking so much as validating.)

Checklists already have the implicit requirement that you stop on errors. This is adding explicit positive checks every so often that the system (or some part of it) is in the state that you and your checklist assume.

(Routine checklists don't include these extra positive checks and I think that they generally shouldn't; these steps are extra work that are only justified if the odds of something going wrong is sufficiently high or the consequences of it are sufficiently dire.)

From 78.35.25.18 at 2010-02-06 04:41:14:

I see your point. Like sprinkling asserts throughout the code, I suppose?

Aristotle Pagaltzis

By cks at 2010-02-06 14:04:01:

I think assert is a nice analogy; just as assert does with code, you're explicitly checking for preconditions and postconditions that should be true. This will catch both mistakes and times where the conditions are genuinely different than what your code/procedures expect.

Written on 05 February 2010.
« Outdated documentation is especially risky for sysadmins
Why a laptop is not likely to be my primary machine any time soon »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Feb 5 01:15:16 2010
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.