Considering "iowait" CPU time and CPU utilization

November 13, 2021

Recently, the Prometheus host agent landed a commit (from a pull request) that excluded the CPU iowait% from an example that showed 'CPU utilization'. Seeing this commit fly by got me thinking about our dashboards, where our overall 'system utilization' dashboard shows a 'non idle time' graph that includes iowait%. My current conclusion is that our overview graph is correct for what we're interested in, but what we're interested in is not quite CPU utilization.

Generally iowait% is a narrow measurement, and on multi-CPU Linux machines it's a lower bound, so we can be pretty sure that if a machine reports iowait%, it had one or more CPUs that were totally idle. In this sense, iowait% time should not count toward a "CPU utilization" measurement, because it is time where we could have had the CPU active if we had more work. A system with 50% CPU busy and 50% iowait is a system that could be doing twice as much computation with the right job load.

(As a digression, Linux iostat's '%steal' is CPU utilization in that sense. Such "steal" time is CPU time that's not available to the virtual CPUs because it has been taken by the hypervisor, and so you could not have less of it by scheduling more things.)

However, iowait% is an indication that the machine is doing something instead of sitting idle. If we want to see how active our machines are, we do want to include iowait%. A machine with 90% iowait, 5% user, and 5% system is a very active machine despite having lots of CPU cycles left. This is the purpose of our overview graph; it's a quick way of seeing how genuinely idle our machines are (because, in our environment, our non-compute servers mostly are idle). For this purpose, including iowait% is okay and even desirable.

(In theory including 'steal%' is wrong; in practice, we have no virtual machines, only bare metal servers, so 'steal%' is universally zero and we can stay with the simple version of '100 - %idle'. Well, with a more complicated PromQL expression that amounts to that.)

Written on 13 November 2021.
« A linear, sequential boot and startup order is easier to deal with
Go 1.18 will let you set the version of the "AMD64" architecture to target »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sat Nov 13 23:44:55 2021
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.