Wandering Thoughts archives

2010-02-08

A thought on deliberately slow disaster recovery

Given my earlier entry, here is a thesis: some disasters are big enough that you should stop trying to recover rapidly.

The problem with attempting rapid disaster recovery is that significant disasters are high stress, high pressure situations. Unless you have very good checklists, this is exactly the sort of situation where it's easy to have something go catastrophically wrong through various situations; missed steps, miscommunication between people about who was doing what, failing to notice problem indicators under the pressure of driving full speed ahead, interruptions and distractions making people lose their place, and and so on.

So in this sort of situation, maybe what you should do is slow down. Back off, reduce the stress level, be methodical. Take the time to be organized. Stop sometimes to take a breather. Yes, this requires accepting that the systems will come back up slower than you might have been able to achieve if you went all out and everything went well. But in return, you are much more likely to avoid making the situation (much) worse.

This is a new way of thinking about crisis handling for me, because I am quite a lot a 'go, now now now!' type of person when trying to fix problems. (And yes, some of the time I have probably made the situation worse by rushing to slap apparent bandaids on things; my instinct is to get the system up now and sort out the situation later and, well, this is not always the right answer.)

There's two things that strike me about this. First, the most dangerous crises and disasters from this perspective are not necessarily the huge ones, but the ones that have the highest potential for further damage, the ones that involve your critical infrastructure but have not already done much damage to it.

(To put it one way, if your machine room has burned down you have very little left to lose, no matter what you do.)

Second, this is not necessarily going to be easy. There are going to be a lot of people yelling at you to get things going faster, and a lot of pressure on you in general. I suspect that you're going to want management agreement on this, in advance (because you're unlikely to get it at the time, not with people yelling at your management too).

SlowDisasterRecovery written at 01:16:07; Add Comment

2010-02-05

Emergency procedures checklists need check steps

Given my previous entry, here is a thesis about emergency procedure documentation: you shouldn't just have a checklist for what to do, your checklist should include actual check steps, points where you stop to explicitly confirm that you've done something and it actually works.

Checklists are a good idea, but the common form of a checklist is just a list of steps to be carried out. Under the stress of an emergency situation, I don't think that this is good enough. First, your checklist implicitly assumes that everything works right, and second, it's too easy to be rushed, distracted by some interruption, sleep-deprived, or whatever while you're going through the checklist and lose track of where exactly you are, miss-do something, or miss the potentially subtle signs that something is not working the way that your checklist assumes.

Thus, you need spots in your checklist where you not do things but check things; you take positive steps to make sure that everything is as it should be and that the system is in the state that you and your checklist assume that it is. These checks insure that if something goes wrong, either in the environment or in you carrying out the checklist, that it gets noticed before things go horribly off the rails and explode.

In short: it's not good enough to have a checklist item that says 'throw switch 12'; you need something to confirm that you have in fact thrown switch 12 (and ideally just switch 12) and that the results of throwing switch 12 are what you expect.

You need these checks to be explicit steps in your checklist for the same reason that you have a checklist in the first place; your memory is fallible, especially under stress, and having them written down explicitly maximizes the chances that you will always do this.

(I suspect that one of the lessons that the airline industry can teach system administration is that in this sort of situation it is best to have two people involved, one reading off the checklist and the other one performing the actions and verbally confirming that they've been done. This makes it harder to fool yourself that something has been done or that of course something looks right.)

The corollary to this corollary is that checks should especially be inserted before you about to do damaging operations such as formatting a disk, putting a replacement system online under its production IP address, or force-importing a SAN filesystem on a non-default fileserver.

(Sadly, testing checks is probably even harder than testing documentation normally is; how do you manufacture failures in checklist steps to make sure that your check steps actually do anything useful?)

ChecklistChecks written at 01:15:16; Add Comment

2010-02-03

Outdated documentation is especially risky for sysadmins

The obvious traditional risk of outdated documentation in all its forms is that you rely on it and go wrong somehow; you trust the comments in the source code and write your new code accordingly, and your changes don't work. I think that this risk is especially acute for sysadmins, for two strongly related reasons.

First, much of our documentation tends to be about procedures, not simple information. Following what is actually a wrong or incomplete procedure is a great way to create spectacular failures on the spot. Worse, sysadmins inevitably wind up dealing directly with live systems and live data.

(Yes, you can test procedures just as you test the code that you write, but at some point you have to use them on your live system and this is always somewhat different from the test environment, unless you have a spectacularly complete test environment.)

Second, some of the least used documentation (and thus our most risky ones) is our emergency procedures. When we need to use them, we're in one of the most tense situations possible, under a great deal of pressure to get things fixed now and thus least able to go slowly and carefully and stop if something, anything, seems off. This is the exact sort of situation where incorrect procedure documentation can do the most damage, because people don't stop before they compound a small problem into a huge one.

(Imagine, for example, an off by one error in documentation about how to map disk bay slots to device names. Now add a 'get things back up right away' crisis where you need to replace a disk.)

OutdatedDocumentationRiskII written at 23:20:41; Add Comment

How to destroy people's interest in updating documentation

Here is one of the less obvious perils of outdated documentation:

Suppose that you have some documentation that is out of date, but not in an obvious way; for example, you have an out of date network layout diagram. Since it's not obvious you don't realize this right away, so you keep on updating the network layout diagram when you make changes to your actual network.

Except that faithfully updating an inaccurate network layout diagram is relatively pointless. When you realize that it is incorrect, you are going to have to re-check most of it anyways, or at least spend a bunch of effort to reconstruct what sections are trustworthy.

This peril of outdated documentation is that updating bad documentation is wasted effort. (Fixing bad documentation is not, but that's a different thing.)

Since updating documentation takes time that you could be using for other things, and it's generally not fun, it does not take too much time to be wasted this way before people stop doing updating documentation entirely. Why do annoying wasted effort, when you could be doing something that's actually productive and useful? (Especially if you did the work thinking that it wasn't wasted effort, only to find out later that what you thought was productive work, well, wasn't. People really don't like that.)

At first, this effect will probably be limited to documentation that is highly suspect. But I don't think it takes much bad documentation before people more or less give up totally, because it is too heartbreaking to waste time this way and they can't stand the idea of it any more; you will lose the culture of documentation. At that point, you can stop talking about updating documentation and start talking about reconstructing it from scratch.

(This is where local wikis are perhaps less than ideal, because at this stage what you really need to do is pave everything so that there is a clear line between 'done recently, can be trusted' and 'is old, do not trust until it has been redone'.)

OutdatedDocumentationRisk written at 01:58:54; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.