2006-03-30
What disk IO stats you get from the Linux kernel
To follow up my previous entry on iostat problems, here's a rundown of the information you actually get from the Linux kernel.
First off, you only get this from 2.6 kernels, or 2.4 kernels with the
Red Hat disk stats patch (such as Red Hat Enterprise 3). In 2.6 this
information appears in /proc/diskstats
; in Red Hat's 2.4, it appears
in /proc/partitions
with slightly more fields.
/proc/diskstats
fields for devices (as opposed to partitions) are,
in order (and using the names Red Hat labeled them with):
major minor name rio rmerge rsect ruse wio wmerge wsect wuse running use aveq
In /proc/diskstats
, partitions show only the major, minor,
name, rio, rsect, wio, and wsect fields. In the Red Hat 2.4 code,
/proc/partitions
shows all fields for partitions, although
you're still probably better off using the device.
These mean:
rio | number of read IO requests completed |
rmerge | number of submitted read requests that were merged into existing requests. |
rsect | number of read IO sectors submitted |
ruse | total length of time all completed read requests have taken to date, in milliseconds |
w* versions | same as the r* versions, but for writes. |
running | instantaneous count of IOs currently in flight |
use | how many milliseconds there has been at least one IO in flight |
aveq | the sum of how long all requests have spent in flight, in milliseconds |
Just to confuse everyone, the sector and merge counts are for submitted IO requests, but rio/wio and ruse/wuse are for completed IO requests. If IO is slow, bursty, or both, this difference can be important when trying to compute accurate numbers for things like the average sectors per request. (I've usually seen this for large writes during high IO load.)
The aveq number is almost but not quite the sum of ruse and wuse, because it also counts incomplete requests. All of ruse, wuse, use, and aveq can occasionally run backwards.
We can now see how iostat
computes several fields:
iostat field |
Computed as | what |
avgrq-sz | (rsect + wsect) / (rio + wio) | the average sectors per request |
avgqu-sz | (aveq / use) | the average queue size |
await | (ruse + wuse) / (rio + wio) | the average time to completion for IO |
While it would be useful to show 'rgrp-sz', 'wgrp-sz', 'rwait', and
'wwait' figures, iostat
does not do so. This is unfortunate, as read
and write IOs usually have very different characteristics (eg, typical
write IO requests usually take significantly longer to complete than
reads).
We can also see how the iostat
svtcm
field, the average IO service
time, is bogus: there is simply no information on that provided by the
kernel. The kernel would need a 'rduse' / 'wduse' set of fields that
reported the total time taken once the requests had been picked up by the
device driver (and it'd need to record the information).
(If you care, iostat
computes svtcm
as 'use / (rio + wio)'. This is
less than obvious in the source code, because you have to cancel out a
number of other terms. Also, it shows why svctm
drops as your IO
load rises (once you've hit 100% utilization).)
If you want to check the kernel code that does the work, it's in
drivers/block
in ll_rw_blk.c
and genhd.c
, in both 2.6 and Red Hat
2.4. ll_rw_blk.c
maintains the numbers; genhd.c
displays them.