SLAs, downtime, and planning

January 18, 2013

Disagreeing with Tom Limoncelli is sort of taking my life into my hands, but sometimes I can't help it. I have large and complex reactions to his All outages are due to a failure to plan, but to start with I want to jump on one bit:

What about the kind of outages that are completely unavoidable? That is a failure to plan to have an SLA that permits a reasonable amount of downtime each year. If your plan includes up to 4 hours of downtime each year, those first 239 minutes are not an outage. [...]

I feel that this is either flat out wrong or misleadingly written and vacuous. At best it is true only for large organizations (like Google) that have decided that they cannot be down ever for more than a certain amount of time no matter what it takes.

Let me use an example that is not as hypothetical as I would like it to be: suppose that our machine room suffered major damage and was a total loss, perhaps from the building burning down, perhaps from a local flood. Depending on exactly what the disaster was, recovery would almost certainly take more than a week. What SLA can we write to cover this?

If Tom Limoncelli means us to take 'SLA' and reasonable allowed downtime hours literally, there is no non-laughable SLA that we can write to cover this. Our SLA would have to say 'allowable downtime: a few weeks (continuous)', at which point it is not a SLA but a joke. But (based on a comment reply he wrote), it seems that Tom doesn't mean this quite literally; instead he means that your 'allowed downtime' should be documented (including circumstances). If this is the case, his article is misleadingly written (since it talks about SLAs only in the usual 'hours a year' terms), unclear, and ultimately essentially vacuous. What he really seems to mean is 'document all of the situations where you will be down for an indeterminate amount of time' (and then get people to agree to them). I don't think that this is useful advice for several reasons.

First, there's very little point to it except as an excuse. It is an exercise in preparing a document that you will hand to management in order to be able to later say that you warned everyone that something could happen. If you have decent management, everyone will look back after the building has burned down and not blame you for the resulting downtime. If you have management that would blame you for not warning them that there would be a major downtime if the building burned down, you need a new job (and it's likely that preparing a document will not stop said management from blaming you anyways).

(If you get explicitly asked 'what could really ruin our week and what can we do about it', then sure, prepare a document that's as comprehensive as you can make it.)

Second, it's very hard to actually foresee all of the possible disaster scenarios that could happen to you in any detail. The universe is a very perverse place, often far more perverse than we can imagine to any degree of specificness. If you are specific you are not likely to be comprehensive and then you expose yourself to Tom's accusation of 'failure to plan' (because in hindsight it is both easy and tempting to say 'you should have seen that obvious possibility'). If you are general you are in practice uselessly vacuous; it boils down to 'if we suffer a major catastrophe (whatever that is) we will be down for some unknowable amount of time'. There, I just wrote your SLA. Again, if your management demands something like this, find a new job.

Personally and specifically, I'm confident that I can't possibly inventory all of the terrible things that could happen to knock us out of action for at least a week. For example, before last year I doubt I would have thought to include 'AC seizes up, machine room overheats, sprinkler heads pop open in the high temperatures, all machines flooded from above for some time before power is cut off' as a disaster scenario. Or even just the general 'sprinkler heads activated without a fire'.

(If Tom Limoncelli would have me write that as the general 'machine room is lost', well, we once again circle back to vacuous 'plans'. You might as well document the situations you think you can recover from and then write 'for any other disaster, we don't know but we'll probably be down for a while'.)

Comments on this page:

From at 2013-01-19 05:26:06:

I fail to see how some planning for failures couldn't have helped here. Doesn't disaster recovery planning for particular servers and whole machine rooms cover this? Or perhaps site fail-over planning? You certainly don't need individual plans for the different ways a particular server can be destroyed, other than making sure that is is really switched off and that something housed elsewhere can take over.

By cks at 2013-01-19 14:45:40:

The quick answer (which I didn't make clear in my original entry) is that for reasons well outside the scope of this entry we effectively can't create a meaningful and detailed disaster recovery plan (although we have offsite backups and so on). The short version why is that there is no way for us to know in advance what resources we would have during a recovery from a disaster.

From at 2013-01-19 15:34:13:

My boss and I sat through an hour long phone conference with a Disaster Recovery firm that our headquarters insisted we use. From the conference it seemed like a great company, great opportunity and a good way to handle worst-case scenarios.

So we got to the end of the call and they finally asked "So where are you guys based? We'd like to arrange an on-site visit if possible?". "Hawaii" we replied. "Oh, we only provide service to the 48 contiguous states". facepalm

There are so many possible scenarios for failure it's just impossible to even begin to document them, yet I'm increasingly being required to do so by our 3rd party security auditors, not by my management staff. That's causing no end of irritation for all parties involved. We've gone from fairly straightforward and standard Business Continuity Plans to them wanting almost absurd detail that frequently bares no basis in technical reality. e.g. MySQL replication is absolutely nothing like RAID replication, the methods of failure are different, the recovery process is different etc. etc. :)

From at 2013-01-21 11:28:03:

Well, Google can (and does) build datacenters over geographically distant points to prevent total outages in individual datacenters, so they're covered for "if we lose a datacenter" type scenarios.

But for shops too small for that kind of thing, this can also take the form of "we have offsite backups and in a pinch, we can use Amazon until we get our own hardware back in operation".

Or, sometimes you tell the boss "Well, what happens if there's a fire or flood in the datacenter?" and the boss just says "We'll have to accept that risk". In which case, you've already done all you can do. :)

From at 2013-01-25 14:47:58:

I think that Limoncelli was perhaps terse (which i realize is surprising coming from such a voluble guy). For example, my current gig exists in a single data center, while $CORP_HQ has a DR mandate. We accept that, in case of meteor strike, we have no effective contingency plan. So it can be argued that having no plan is our plan, due to cost-benefit analysis. We can get philosophical and argue whether an SLA where the service level is 0 is actually an SLA. So, in summary, i agree with --rone

Written on 18 January 2013.
« More on my favorite way of marking continued lines
Real disaster recovery plans require preallocated resources »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Jan 18 22:48:47 2013
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.