Disk IO latency is often what matters
After recent experiences, I've become convinced that my current methods of testing and monitoring disk performance are what I'd call inadequate. Most of my testing and monitoring focuses on disk bandwidth, often streaming IO bandwidth because it's easy to be consistent with that. One problem with this is that random IO matters too, but the bigger problem is that I've come to believe that latency is what really affects your perceived performance.
Yes, you need good disk bandwidth in order to deliver decent
performance. Given the things that can happen with modern disks, it's worth testing (and necessary to test), but
it's not sufficient. In many cases, good bandwidth with bad latency will
give you a terrible user experience, because latency is what you need
to react promptly to user actions. Very little of what normal people
do on systems is latency-insensitive. When you save a file, click on a
mail message in your IMAP reader, or even run
ls, how fast IO starts
responding makes a huge difference to how the system feels.
I've further come to feel that what matters is not average latency but more like the 99% or 99.9% percentile points. Modern environments do a lot of disk IO so a 'one in a hundred operations' delay happens quite frequently, probably every thirty seconds or less if you're actively using the system (after all, 100 operations in 30 seconds is less than four a second). And the worse these long latencies are, the larger and more annoying the stall is from the user's perspective. It doesn't take much before the system is stuttering and painful to use.
As before, I'm certain that I've read this in various places before but getting smacked in the nose with it has made the whole thing that much more real.
(I'm going to have to think hard about how to test latencies in a useful, repeatable way. A good starting point will be the question of just what I want to measure.)