Linux's iowait statistic and multi-CPU machines
Yesterday I wrote about how multi-CPU machines quietly complicate
the standard definition of iowait,
because you can have some but not all CPUs idle while you have
processes waiting on IO. The system is not totally idle, which is
what the normal Linux definition of iowait is about,
but some CPUs are idle and implicitly waiting for IO to finish.
Linux complicates its life because iowait is considered to be a
per-CPU statistic, like user, nice, system, idle, irq, softirq,
and the other per-CPU times reported in
As it turns out, this per-CPU iowait figure is genuine, in one
sense; it is computed separately for each CPU and CPUs may report
significantly different numbers for it. How modern versions of the
Linux kernel keep track of iowait involves something between brute
force and hand-waving. Each task (a process or thread) is associated
with a CPU while it is running. When a task goes to sleep to wait
for IO, it increases a count of how many tasks are waiting for IO
'on' that CPU, called
nr_iowait. Then if
nr_iowait is greater
than zero and the CPU is idle, the idle time is charged to iowait
for that CPU instead of to 'idle'.
(You can see this in the code in
The problem with this is that a task waiting on IO is not really attached to any particular CPU. When it wakes up, the kernel will try to run it on its 'current' CPU (ie the last CPU it ran on, the CPU who's run queue it's in), but if that CPU is busy and another CPU is free, the now-awake task will be scheduled on that CPU. There is nothing that particularly guarantees that tasks waiting for IO are evenly distributed across all CPUs, or are parked on idle CPUs; as far as I know, you might have five tasks all waiting for IO on one CPU that's also busy running a sixth task, while five other CPUs are all idle. In this situation, the Linux kernel will happily say that one CPU is 100% user and five CPUs are 100% idle and there's no iowait going on at all.
(As far as I can see, the per-CPU number of tasks waiting for IO
is not reported at all. A global number of tasks in iowait is
/proc/stat, but that doesn't
tell you how they're distributed across your CPUs. Also, it's
an instantaneous number instead of some sort of accounting of
this over time.)
There's a nice big comment about this in kernel/sched/core.c
nr_iowait(), if you have to find it because the
source has shifted). The comment summarizes the situation this way,
This means, that when looking globally, the current IO-wait accounting on SMP is a lower bound, by reason of under accounting.
(It also says in somewhat more words that looking at the iowait for individual CPUs is nonsensical.)
Programs that report per-CPU iowait numbers on Linux are in some sense not incorrect; they're faithfully reporting what the kernel is telling them. The information they present is misleading, though, and in an ideal world their documentation would tell you that per-CPU iowait is not meaningful and should be ignored unless you know what you're doing.
PS: It's possible that
provide useful information here, if you have a sufficiently modern
kernel. Unfortunately the normal Ubuntu 18.04 server kernel is not