October 6, 2012

To follow up on my illustrated example of this, I wanted to talk about how averages mislead people. They do it in at least two different ways.

The first way that averages mislead is that they smooth out exceptions. The longer the amount of time you average across and the more activity you see, the more that an average will hide exceptional activity (well, burry it under a mass of normal activity). You generally can't do very much about the amount of activity, so if you want to spot exceptions using an average you need to look at your 'average' over very short time intervals. Our recent issue was a great example of this. Exceptionally slow disk activity that wasn't really visible in a 60-second average did sometimes jump out in a one-second average. Of course the problem with fast averages is that then you generate a lot of results to go through (and also it's noisy).

It's worth understanding that this is not a problem with averages as such. Since the purpose of averages is to smooth things out, using an average should mean that you don't care about exceptions. If you do care about exceptions you need a different metric. Unfortunately people don't always provide one, which is a problem. The corollary is that if you're designing the statistics that your system will report and you plan to only report averages, you should be really confidant that exceptions either won't happen or won't matter. And you're probably wrong about both parts of that.

(Exceptional activity does affect even a long-term average, but it often doesn't affect it enough for things to be obviously wrong. Instead of saying 'this is crazy', you say 'hmm, things are slower than I was expecting'.)

The second way that averages mislead is that they hide the actual distribution of values. The usual assumption with averages is that you have a nice bell-shaped distribution centered around the average, but this is not necessarily the case. All sorts of distributions will give you exactly the same average and they have very different implications for how your system works. A disk IO system with a normal distribution centered on the average value is likely to feel very different from a disk IO system that has, say, two normal distributions superimposed on top of each other, one significantly faster than the average and one significantly slower.

(This is where my ignorance of most of statistics kicks in, because I don't know if there's some simple metrics that will give you a sense of the actual distribution is or if you really need to plot the distribution somehow and take a look at it.)

My illustrated example involved both ways. The so-so looking average was hiding significant exceptions and the exceptions were not random outliers; instead they were part of a distinct distribution. In the end it turned out that what looked like one distribution was in fact two distinct distributions stacked on top of each other, but that's another entry.

From 173.164.235.197 at 2012-10-06 05:48:53:

The usual way to get a feel for a distribution is quantiles: minimum/median/maximum, maybe quartiles (25th and 75th percentile), and probably 95th and 99th percentile (or 84.13th, 97.72nd, 99.87th pecentiles if you're used to thinking in sigmas). However, quantiles aren't the most convenient metric to compute on-the-fly: the naïve algorithm is "take the complete list of all values, sort it, then pick the n-th value in the sorted list such that `n/len(list) ≈ quantile`". (E.g. the item at `n=0.5*len(list)` for the 50th percentile a.k.a. the median. If n is not an integer, interpolate linearly.) There may be some tricks to get constant-memory, constant-time behavior for a fixed list of desired quantiles, although I don't know of any off the top of my head. There's definitely an O(1)-space algorithm for the n-th largest item given fixed n, but I don't think it can be adapted for when n changes depending on the list length.

Instead of using real quantiles, the easy way is to turn the problem inside-out using a histogram: create buckets of 0ms..1ms, 1ms..2ms, 2ms..4ms, and so on, doubling exponentially until you hit sufficiently ridiculous values that you can end with an x..infinity bucket and be satisfied with it. Updating is simple: each time you see a value, simply increment the corresponding bucket counter. The ratio between any one counter and the sum of all counters equals the fraction of values that fell in that bucket's range; ASCII asterisks or hash marks growing left-to-right are dead simple with printf and make the data easy to interpret visually. For some computations, it makes sense to convert 0ms..1ms, 1ms..2ms, 2ms..4ms buckets into 0ms..1ms, 0ms..2ms, 0ms..4ms buckets; the latter form is called a "reverse cumulative distribution function", apparently, with "cumulative" the key word. The bucket counter ratios in the reverse-CDF scheme directly translate to quantiles for each bucket cutoff, making it a particularly useful format for creating monitoring rules around quantile thresholds (e.g. "[measured value: the fraction of disk I/O operations that take less than 64ms] should be 99.9% or higher").

Written on 06 October 2012.