Linux's iowait statistic and multi-CPU machines

March 7, 2020

Yesterday I wrote about how multi-CPU machines quietly complicate the standard definition of iowait, because you can have some but not all CPUs idle while you have processes waiting on IO. The system is not totally idle, which is what the normal Linux definition of iowait is about, but some CPUs are idle and implicitly waiting for IO to finish. Linux complicates its life because iowait is considered to be a per-CPU statistic, like user, nice, system, idle, irq, softirq, and the other per-CPU times reported in /proc/stat (see proc(5)).

As it turns out, this per-CPU iowait figure is genuine, in one sense; it is computed separately for each CPU and CPUs may report significantly different numbers for it. How modern versions of the Linux kernel keep track of iowait involves something between brute force and hand-waving. Each task (a process or thread) is associated with a CPU while it is running. When a task goes to sleep to wait for IO, it increases a count of how many tasks are waiting for IO 'on' that CPU, called nr_iowait. Then if nr_iowait is greater than zero and the CPU is idle, the idle time is charged to iowait for that CPU instead of to 'idle'.

(You can see this in the code in account_idle_time() in kernel/sched/cputime.c.)

The problem with this is that a task waiting on IO is not really attached to any particular CPU. When it wakes up, the kernel will try to run it on its 'current' CPU (ie the last CPU it ran on, the CPU who's run queue it's in), but if that CPU is busy and another CPU is free, the now-awake task will be scheduled on that CPU. There is nothing that particularly guarantees that tasks waiting for IO are evenly distributed across all CPUs, or are parked on idle CPUs; as far as I know, you might have five tasks all waiting for IO on one CPU that's also busy running a sixth task, while five other CPUs are all idle. In this situation, the Linux kernel will happily say that one CPU is 100% user and five CPUs are 100% idle and there's no iowait going on at all.

(As far as I can see, the per-CPU number of tasks waiting for IO is not reported at all. A global number of tasks in iowait is reported as procs_blocked in /proc/stat, but that doesn't tell you how they're distributed across your CPUs. Also, it's an instantaneous number instead of some sort of accounting of this over time.)

There's a nice big comment about this in kernel/sched/core.c (just above nr_iowait(), if you have to find it because the source has shifted). The comment summarizes the situation this way, emphasis mine:

This means, that when looking globally, the current IO-wait accounting on SMP is a lower bound, by reason of under accounting.

(It also says in somewhat more words that looking at the iowait for individual CPUs is nonsensical.)

Programs that report per-CPU iowait numbers on Linux are in some sense not incorrect; they're faithfully reporting what the kernel is telling them. The information they present is misleading, though, and in an ideal world their documentation would tell you that per-CPU iowait is not meaningful and should be ignored unless you know what you're doing.

PS: It's possible that /proc/pressure/io can provide useful information here, if you have a sufficiently modern kernel. Unfortunately the normal Ubuntu 18.04 server kernel is not sufficiently modern.


Comments on this page:

For anyone else who was wondering, a quick search says /proc/pressure appeared in kernel 4.20, and the user-space monitoring interface was added in 5.2. Ubuntu 18.04 LTS (also the linux-aws kernel on EC2) are on 4.15.

By Anon at 2020-03-09 15:29:50:

I thought the 18.04 HWE kernels were up to 5.3 (https://packages.ubuntu.com/bionic/linux-generic-hwe-18.04 )?

Written on 07 March 2020.
« The problem of Unix iowait and multi-CPU machines
How we sort of automate updating system packages across our Ubuntu machines »

Page tools: View Source, View Normal.
Search:
Login: Password:

Last modified: Sat Mar 7 01:11:12 2020
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.