Disk IO latency is often what matters

October 10, 2012

After recent experiences, I've become convinced that my current methods of testing and monitoring disk performance are what I'd call inadequate. Most of my testing and monitoring focuses on disk bandwidth, often streaming IO bandwidth because it's easy to be consistent with that. One problem with this is that random IO matters too, but the bigger problem is that I've come to believe that latency is what really affects your perceived performance.

Yes, you need good disk bandwidth in order to deliver decent performance. Given the things that can happen with modern disks, it's worth testing (and necessary to test), but it's not sufficient. In many cases, good bandwidth with bad latency will give you a terrible user experience, because latency is what you need to react promptly to user actions. Very little of what normal people do on systems is latency-insensitive. When you save a file, click on a mail message in your IMAP reader, or even run ls, how fast IO starts responding makes a huge difference to how the system feels.

I've further come to feel that what matters is not average latency but more like the 99% or 99.9% percentile points. Modern environments do a lot of disk IO so a 'one in a hundred operations' delay happens quite frequently, probably every thirty seconds or less if you're actively using the system (after all, 100 operations in 30 seconds is less than four a second). And the worse these long latencies are, the larger and more annoying the stall is from the user's perspective. It doesn't take much before the system is stuttering and painful to use.

As before, I'm certain that I've read this in various places before but getting smacked in the nose with it has made the whole thing that much more real.

(I'm going to have to think hard about how to test latencies in a useful, repeatable way. A good starting point will be the question of just what I want to measure.)

Comments on this page:

By trs80 at 2012-10-14 07:47:20:

You may be interested in the Tech Report's attempt at graphing frame latencies.

Written on 10 October 2012.
« The negative results problem with search engines
Controlling Linux TCP socket send buffer sizes »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed Oct 10 02:12:43 2012
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.