A thought on deliberately slow disaster recovery

February 8, 2010

Given my earlier entry, here is a thesis: some disasters are big enough that you should stop trying to recover rapidly.

The problem with attempting rapid disaster recovery is that significant disasters are high stress, high pressure situations. Unless you have very good checklists, this is exactly the sort of situation where it's easy to have something go catastrophically wrong through various situations; missed steps, miscommunication between people about who was doing what, failing to notice problem indicators under the pressure of driving full speed ahead, interruptions and distractions making people lose their place, and and so on.

So in this sort of situation, maybe what you should do is slow down. Back off, reduce the stress level, be methodical. Take the time to be organized. Stop sometimes to take a breather. Yes, this requires accepting that the systems will come back up slower than you might have been able to achieve if you went all out and everything went well. But in return, you are much more likely to avoid making the situation (much) worse.

This is a new way of thinking about crisis handling for me, because I am quite a lot a 'go, now now now!' type of person when trying to fix problems. (And yes, some of the time I have probably made the situation worse by rushing to slap apparent bandaids on things; my instinct is to get the system up now and sort out the situation later and, well, this is not always the right answer.)

There's two things that strike me about this. First, the most dangerous crises and disasters from this perspective are not necessarily the huge ones, but the ones that have the highest potential for further damage, the ones that involve your critical infrastructure but have not already done much damage to it.

(To put it one way, if your machine room has burned down you have very little left to lose, no matter what you do.)

Second, this is not necessarily going to be easy. There are going to be a lot of people yelling at you to get things going faster, and a lot of pressure on you in general. I suspect that you're going to want management agreement on this, in advance (because you're unlikely to get it at the time, not with people yelling at your management too).


Comments on this page:

From 143.48.3.13 at 2010-02-08 12:02:44:

I understand the gist of what you're saying, but in any well-run organization, IT doesn't dictate the business requirements -- business dictates the business requirements. If the business's SLAs say that services X, Y, and Z need to, without fail, be back up and running at the DR site within N hours of a disaster event, they're going to be up and running within N hours or you're going to be in a world of hurt.

Written on 08 February 2010.
« The problem with blog footnotes
Why your program should have an actual configuration file »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Feb 8 01:16:07 2010
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.