How averages mislead you
To follow up on my illustrated example of this, I wanted to talk about how averages mislead people. They do it in at least two different ways.
The first way that averages mislead is that they smooth out exceptions. The longer the amount of time you average across and the more activity you see, the more that an average will hide exceptional activity (well, burry it under a mass of normal activity). You generally can't do very much about the amount of activity, so if you want to spot exceptions using an average you need to look at your 'average' over very short time intervals. Our recent issue was a great example of this. Exceptionally slow disk activity that wasn't really visible in a 60-second average did sometimes jump out in a one-second average. Of course the problem with fast averages is that then you generate a lot of results to go through (and also it's noisy).
It's worth understanding that this is not a problem with averages as such. Since the purpose of averages is to smooth things out, using an average should mean that you don't care about exceptions. If you do care about exceptions you need a different metric. Unfortunately people don't always provide one, which is a problem. The corollary is that if you're designing the statistics that your system will report and you plan to only report averages, you should be really confidant that exceptions either won't happen or won't matter. And you're probably wrong about both parts of that.
(Exceptional activity does affect even a long-term average, but it often doesn't affect it enough for things to be obviously wrong. Instead of saying 'this is crazy', you say 'hmm, things are slower than I was expecting'.)
The second way that averages mislead is that they hide the actual distribution of values. The usual assumption with averages is that you have a nice bell-shaped distribution centered around the average, but this is not necessarily the case. All sorts of distributions will give you exactly the same average and they have very different implications for how your system works. A disk IO system with a normal distribution centered on the average value is likely to feel very different from a disk IO system that has, say, two normal distributions superimposed on top of each other, one significantly faster than the average and one significantly slower.
(This is where my ignorance of most of statistics kicks in, because I don't know if there's some simple metrics that will give you a sense of the actual distribution is or if you really need to plot the distribution somehow and take a look at it.)
My illustrated example involved both ways. The so-so looking average was hiding significant exceptions and the exceptions were not random outliers; instead they were part of a distinct distribution. In the end it turned out that what looked like one distribution was in fact two distinct distributions stacked on top of each other, but that's another entry.