SLAs, downtime, and planning

January 18, 2013

Disagreeing with Tom Limoncelli is sort of taking my life into my hands, but sometimes I can't help it. I have large and complex reactions to his All outages are due to a failure to plan, but to start with I want to jump on one bit:

What about the kind of outages that are completely unavoidable? That is a failure to plan to have an SLA that permits a reasonable amount of downtime each year. If your plan includes up to 4 hours of downtime each year, those first 239 minutes are not an outage. [...]

I feel that this is either flat out wrong or misleadingly written and vacuous. At best it is true only for large organizations (like Google) that have decided that they cannot be down ever for more than a certain amount of time no matter what it takes.

Let me use an example that is not as hypothetical as I would like it to be: suppose that our machine room suffered major damage and was a total loss, perhaps from the building burning down, perhaps from a local flood. Depending on exactly what the disaster was, recovery would almost certainly take more than a week. What SLA can we write to cover this?

If Tom Limoncelli means us to take 'SLA' and reasonable allowed downtime hours literally, there is no non-laughable SLA that we can write to cover this. Our SLA would have to say 'allowable downtime: a few weeks (continuous)', at which point it is not a SLA but a joke. But (based on a comment reply he wrote), it seems that Tom doesn't mean this quite literally; instead he means that your 'allowed downtime' should be documented (including circumstances). If this is the case, his article is misleadingly written (since it talks about SLAs only in the usual 'hours a year' terms), unclear, and ultimately essentially vacuous. What he really seems to mean is 'document all of the situations where you will be down for an indeterminate amount of time' (and then get people to agree to them). I don't think that this is useful advice for several reasons.

First, there's very little point to it except as an excuse. It is an exercise in preparing a document that you will hand to management in order to be able to later say that you warned everyone that something could happen. If you have decent management, everyone will look back after the building has burned down and not blame you for the resulting downtime. If you have management that would blame you for not warning them that there would be a major downtime if the building burned down, you need a new job (and it's likely that preparing a document will not stop said management from blaming you anyways).

(If you get explicitly asked 'what could really ruin our week and what can we do about it', then sure, prepare a document that's as comprehensive as you can make it.)

Second, it's very hard to actually foresee all of the possible disaster scenarios that could happen to you in any detail. The universe is a very perverse place, often far more perverse than we can imagine to any degree of specificness. If you are specific you are not likely to be comprehensive and then you expose yourself to Tom's accusation of 'failure to plan' (because in hindsight it is both easy and tempting to say 'you should have seen that obvious possibility'). If you are general you are in practice uselessly vacuous; it boils down to 'if we suffer a major catastrophe (whatever that is) we will be down for some unknowable amount of time'. There, I just wrote your SLA. Again, if your management demands something like this, find a new job.

Personally and specifically, I'm confident that I can't possibly inventory all of the terrible things that could happen to knock us out of action for at least a week. For example, before last year I doubt I would have thought to include 'AC seizes up, machine room overheats, sprinkler heads pop open in the high temperatures, all machines flooded from above for some time before power is cut off' as a disaster scenario. Or even just the general 'sprinkler heads activated without a fire'.

(If Tom Limoncelli would have me write that as the general 'machine room is lost', well, we once again circle back to vacuous 'plans'. You might as well document the situations you think you can recover from and then write 'for any other disaster, we don't know but we'll probably be down for a while'.)

Written on 18 January 2013.
« More on my favorite way of marking continued lines
Real disaster recovery plans require preallocated resources »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Jan 18 22:48:47 2013
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.