Simple availability doesn't capture timing and the amount of warning

August 31, 2013

Here is a mistake that I have actually kind of made: a simple availability or 'amount of downtime' number does not fully capture your availability situation. In real life it matters a lot both when you go down and whether or not you have advance warning. To put it simply, an hour of planned downtime at 6pm is qualitatively different from an hour of unplanned downtime at 6pm (or at 11am on your busiest morning) even if they have exactly the same effect on your overall availability numbers.

(I've sometimes seen availability numbers cited as excluding planned downtimes. That strikes me as disingenuous unless it comes with very careful disclaimers and a bunch of additional information.)

Of course it's better to not have the downtime at all, but if you're going to have it it's generally quite worthwhile to transform an unplanned downtime into a planned one (often even if the planned downtime is longer). There is a surprising amount of technology that effectively exists to do this conversion; for example, any non-hotswappable form of redundancy.

(If you have some form of redundancy that you can't hotswap and one half of it breaks (so now you have no redundancy), you're going to have to eventually take things down to restore the redundancy. This shifts the unplanned downtime of losing your only whatever-it-is to the planned downtime of replacing one.)

Sidebar: UPSes in this view

If you have a perfect UPS and no source of alternate or additional power (a redundant power supply, a transfer switch, etc), you're likely converting unplanned power failures into planned UPS battery replacements. In real life UPSes have been known to cause problems and it's usually not that difficult to have power redundancy. Overall a good setup probably simply decreases the chances of unplanned downtimes.

(Our UPSes exist not to prevent unplanned downtimes from power loss but to hopefully prevent unplanned downtimes from ZFS pool corruption due to power loss. This gives me an odd perspective on UPS issues.)


Comments on this page:

By Perry Lorier at 2013-09-01 05:13:28:

I've seen this quoted as "Availability", which is what percentage of the time you're actually up and "Reliability" which is what percentage of the time you're up excluding planned maintenances (usually when there's a rigid planned maintenance procedure that involves signoffs by more than just the guy doing the work.

The difference between the numbers tells you how well you're doing.

Written on 31 August 2013.
« HTML quoting as I currently understand it
A little bit more on ZFS RAIDZ read performance »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sat Aug 31 23:02:42 2013
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.