2013-01-22
Disaster recovery preparation is not the same as a DR plan
One of the things that I think happens in the general area of disaster recovery is that the terminology gets confused (partly because a lot of people, myself included, sort of have their toes near the water without actually being specialists). Because some of my views about disaster recovery plans turn on what I consider a DR plan to involve, I want to write down my views.
(I've sort of mentioned them in passing a couple of times recently but I feel like making things explicit.)
So I'll start with my view of the terminology:
- DR procedures are explicit, more or less step by step documentation
that you could follow to bring your systems back up after a disaster.
The archetype of DR procedures is big binders written up by big
companies, where an unfamiliar person is supposed to be able to follow
all of the steps to restore service. Sometimes the companies stage
exercises to test this (often comedy ensues).
- real DR plans are the high level versions of DR procedures, omitting
all of the voluminous details and command lines in favour of
general descriptions. A real DR plan is still a specific plan,
but it needs to be carried out by you or another experienced local
sysadmin who can fill in all of the blanks. As a specific plan,
a real DR plan still details things like what will be restored
or brought up where. You know how many servers (physical or
virtual) you will use, located where, bought or rented with what
money, connected to the network how, and so on.
- I don't have a good name for the next step: call it an abstract or
aspirational disaster recovery plan. In an abstract DR plan, you
consider and document issues like what services are crucial (and
what their dependencies are), options for how you might bring up
partial services, how many machines you might need and where you
might put them, and so on. However you do not have specifics; you
are just thinking ahead to what you'll probably need if a diaster
struck.
This probably shades into (non-detailed) business continuity planning.
- disaster recovery preparation are general steps that you take to try to make sure you can recover from a disaster. Offsite backups, offsite copies of crucial systems or information, paper documentation of your systems, and writeups of what you would want to restore first in order to bootstrap your environment from the ground up are all disaster recovery preparations.
My feeling is that some degree of DR preparation is easy and relatively general; you don't have to consider very many scenarios in order to set up things like offsite backups. I have mixed feelings about abstract DR plans, part of which boil down to 'documentation needs to be tested' (which is obviously hard for an abstract plan). Actual real and concrete DR plans have a lot of requirements that I think makes them hard for many organizations; among other things they need preallocated resources and for their efforts to be meaningful.
(Note that organizations that take their DR plans and procedures seriously do periodically run tests of them; you really have to, in order to be sure that the plans will work when they're really needed. If your organization 'takes DR seriously' but has never done or budgeted for such a test, you know how seriously it really takes this.)
I have a grumpy reaction to people who go on about how everyone should have DR plans or at least seriously consider DR issues because I think their efforts are driving people away from simple disaster recovery preparations. If you phrase things so that you bundle DR preparation into DR plans and then your description of what's involved in DR planning convinces people that it is too big (and too expensive) for their environment, you are not doing them any favours.
(Of course all of the consultants and DR firms and so on make almost all of their money from actual relatively concrete DR plans, not from simple DR preparation, so they have very little motivation to separate the two things and advise people start with simple steps before going all the way to expensive DR activities. But I am starting to rant here.)
Sidebar: the cynical reason for DR plans to exist
Put simply, DR plans are a blame deflection method for when disaster strikes and things explode. If you ordered your underlings to prepare a DR plan and the DR plan fails, you can generally deflect the resulting blame on to your underlings. In the mean time you can assure the auditors (and your management, if any) that you've considered the issue and you have a plan, honest.
(As always and as before, an organization's actual priorities are shown by what it does, not by what it says that it wants.)
2013-01-20
Real disaster recovery plans require preallocated resources
Here is one core thing about meaningful disaster recovery plans: they all require preallocation of resources. This may range from actual servers in actual racks in an actual machine room, all humming and ready to go the moment that you need them, all the way to simply a bunch of money that is reserved for disaster recovery so that you can immediately start buying new hardware and renting colocation space (or simply getting more cloud computing capacity).
If you do not have these preallocated resources, you do not really have a disaster recovery plan; you don't have something you can immediately start executing in any meaningful way and especially you don't have a plan with a time bound. Without preallocated resources, step zero of your DR plan is 'magically get money and other resources from somewhere' and magic is unpredictable and uncertain.
The problem with the preallocated resources that a meaningful DR plan requires is that they are completely unproductive now, whether they are servers that are basically unused or money that is simply sitting there not being spent. As a result there is always going to be a temptation and pressure to take these unproductive resources and do something with them; to claim servers or machine room space or money for some more urgent need.
This temptation is not stupid. At the extreme bound it's completely wrong to insist on not using the preallocated DR resources if it means that the organization goes out of business in the mean time. The relative priority of allocating resources to DR versus allocating resources to something else is always a tradeoff and a risk assessment. Sometimes DR will lose and thus it will lose resources. How often DR loses is partly a function of the organization's relative priorities and partly a function of how prosperous the organization is (ie, how many surplus resources it has in general).
I will give you a corollary: if your organization is low on resources and it does not prioritize disaster recovery very highly, I feel that there is very little point in creating a meaningful disaster recovery plan. The odds are simply very low that you will be able to hold on to your preallocated resources until a disaster happens, so you will be left with a beautiful plan but no means of carrying it out or only the ability to execute random portions.
(Note that you can still be prepared for disasters even without having an DR plan. To simplify, DR preparation is having offsite backups while a DR plan is knowing what you're going to restore them on to.)
2013-01-18
SLAs, downtime, and planning
Disagreeing with Tom Limoncelli is sort of taking my life into my hands, but sometimes I can't help it. I have large and complex reactions to his All outages are due to a failure to plan, but to start with I want to jump on one bit:
What about the kind of outages that are completely unavoidable? That is a failure to plan to have an SLA that permits a reasonable amount of downtime each year. If your plan includes up to 4 hours of downtime each year, those first 239 minutes are not an outage. [...]
I feel that this is either flat out wrong or misleadingly written and vacuous. At best it is true only for large organizations (like Google) that have decided that they cannot be down ever for more than a certain amount of time no matter what it takes.
Let me use an example that is not as hypothetical as I would like it to be: suppose that our machine room suffered major damage and was a total loss, perhaps from the building burning down, perhaps from a local flood. Depending on exactly what the disaster was, recovery would almost certainly take more than a week. What SLA can we write to cover this?
If Tom Limoncelli means us to take 'SLA' and reasonable allowed downtime hours literally, there is no non-laughable SLA that we can write to cover this. Our SLA would have to say 'allowable downtime: a few weeks (continuous)', at which point it is not a SLA but a joke. But (based on a comment reply he wrote), it seems that Tom doesn't mean this quite literally; instead he means that your 'allowed downtime' should be documented (including circumstances). If this is the case, his article is misleadingly written (since it talks about SLAs only in the usual 'hours a year' terms), unclear, and ultimately essentially vacuous. What he really seems to mean is 'document all of the situations where you will be down for an indeterminate amount of time' (and then get people to agree to them). I don't think that this is useful advice for several reasons.
First, there's very little point to it except as an excuse. It is an exercise in preparing a document that you will hand to management in order to be able to later say that you warned everyone that something could happen. If you have decent management, everyone will look back after the building has burned down and not blame you for the resulting downtime. If you have management that would blame you for not warning them that there would be a major downtime if the building burned down, you need a new job (and it's likely that preparing a document will not stop said management from blaming you anyways).
(If you get explicitly asked 'what could really ruin our week and what can we do about it', then sure, prepare a document that's as comprehensive as you can make it.)
Second, it's very hard to actually foresee all of the possible disaster scenarios that could happen to you in any detail. The universe is a very perverse place, often far more perverse than we can imagine to any degree of specificness. If you are specific you are not likely to be comprehensive and then you expose yourself to Tom's accusation of 'failure to plan' (because in hindsight it is both easy and tempting to say 'you should have seen that obvious possibility'). If you are general you are in practice uselessly vacuous; it boils down to 'if we suffer a major catastrophe (whatever that is) we will be down for some unknowable amount of time'. There, I just wrote your SLA. Again, if your management demands something like this, find a new job.
Personally and specifically, I'm confident that I can't possibly inventory all of the terrible things that could happen to knock us out of action for at least a week. For example, before last year I doubt I would have thought to include 'AC seizes up, machine room overheats, sprinkler heads pop open in the high temperatures, all machines flooded from above for some time before power is cut off' as a disaster scenario. Or even just the general 'sprinkler heads activated without a fire'.
(If Tom Limoncelli would have me write that as the general 'machine room is lost', well, we once again circle back to vacuous 'plans'. You might as well document the situations you think you can recover from and then write 'for any other disaster, we don't know but we'll probably be down for a while'.)