Unix's iowait% is a narrow and limited measure that can be misleading

March 4, 2020

For a long time, iowait% has been one of my standard Unix system performance indicators to check, both for having visible iowait% and for not having it. As I interpreted it, a machine with high or appreciable iowait% clearly had potential IO issues, while a machine with low or no iowait% was fine as far as IO went, including NFS (this is on display in, for example, my entry on the elevated load average of our main IMAP server). Unfortunately, I've recently realized that the second half of this is not actually correct. To put it simply, programs waiting for IO may only create iowait% when the machine is otherwise idle.

Suppose, as a non hypothetical example, that you have a busy IMAP server with lots of sessions from people who are spread all over your fleet of NFS fileservers, some of which are lightly loaded and fast and some of which are not, along with a constant background noise of random attackers on the Internet trying password guessing through SMTP authentication attacks and so on. With a lot of uncorrelated processes, it's quite possible that something will be runnable on most of the times when (some) IMAP sessions are stalled waiting from NFS IO from your most heavily loaded fileserver. Since there are running processes, your waiting processes may well not show up as a visible iowait%, fooling you into thinking that everything is fine as far as IO goes.

In general, a high iowait% is a sign that your entire system is stalled on IO, but a low iowait% isn't necessarily a sign that no important processes are stalled on IO. The situation isn't symmetrical. In an ironic twist given what I wrote recently about it, I now think that an inexplicably high load average is probably a good signal that you have some processes stalling on IO while others are running fine (so that these stalls don't show up as iowait%), at least on Unixes where waiting on IO is reflected in the load average.

(The usual vmstat output reports a 'blocked' count, but that's an instantaneous number and may not fully capture things. The load average is gathered continuously and so will reflect more of the overall situation.)

Now that I've realized this, I'm going to have to be much more careful about seeing a low iowait% and concluding that the system is fine as far as IO goes. Unfortunately I'm not sure if there's any good metrics for this that are widely available and easily worked out, especially for NFS (where you don't generally have a 'utilization' percentage in the way you usually do for local disks).

(There's a practical problem with iowait% on modern Linux systems, but that needs another entry.)

Written on 04 March 2020.
« One impact of the dropping of Python 2 from Linux distributions
The problem of Unix iowait and multi-CPU machines »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed Mar 4 23:33:32 2020
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.