Really understanding availability numbers
People like to talk about reliability and availability, and throw around terms like '5-nines availability' and so on. You may have heard this from server salescritters (alongside their attempts to sell you redundant power supplies).
At the same time, what the terms really imply is not intuitive and is often surprising, especially at the high end.
The following chart comes from my co-worker John Calvin:
|"4-Nines"||99.99%||~5 minutes/month (52 min/year)|
|"3-Nines"||99.9%||~90 seconds/day (8.7 hours/year or 10.5 min/week)|
Things to note:
- a typical system takes longer than 30 seconds to boot.
- a five minute distributed denial of service attack can happen relatively routinely. (It certainly does to us.)
- the August 14th 2003 blackout in eastern US and Canada lasted over half a day for most people.
- very few service contracts will get you useful service in under two hours.
- just swapping power supplies, hard drives, or CPU fans can easily
take more than ten minutes.
- there are many possible 4-Nines and 3-Nines availabilities, based on how much downtime you can accept in a single incident. 3-Nines where you can only be down for 10 minutes in any single incident is very different from 3-Nines where you can be down for a couple of hours twice a year.
Even with a good service contract, a single commodity server is exposed to multi-hour outages and is so at best 3-Nines available. In fact, anything that can ever require service is at most 3-Nines available.
(In practice you are cruising on the edge of even 3-Nines; you are betting on only a few service calls a year.)