Some Linux disk IO stats you can calculate from kernel information

November 23, 2018

I've written in the (distant) past about what disk IO stats you get from the Linux kernel and what per-partition stats you get. Now I'm interested in what additional stats you can calculate from these, especially stats that aren't entirely obvious.

These days, the kernel's Documentation/iostats.txt more or less fully documents all of the raw stats that the kernel makes available (it's even recently been updated to add some very recent new stats that are only available in 4.18+), and it's worth reading in general. However, for the rest of this entry I'm mostly going to use my names for the fields from my original entry for lack of better ones (iostats.txt describes the fields but doesn't give them short names).

The total number of reads or writes submitted to the kernel block layer is rio plus rmerge (or wio plus wmerge), since rio includes only reads that go to the disk. As it was more than a decade ago, the count of merged IOs (rmerge and wmerge) is increased when the IO is submitted, while the count of completed IOs only increases at the end of things. However, these days the count of sectors read and written is for completed IO, not submitted IO. If you ignore merged IOs (and I usually do), you can basically ignore the difference between completed IOs and submitted but not yet completed IOs.

(A discussion of the issues here is beyond the scope of this entry.)

The per-second rate of rio and wio over time is the average number of IOs per second. The per-second rate of rsect and wsect is similarly the average bandwidth per second (well, once you convert it from 512-byte 'sectors' to bytes or KB or whatever unit you like). As usual, averages can conceal wild variations and wild outliers. Bytes over time divided by requests over time gives you the average request size, either for a particular type of IO or in general.

(When using these stats, always remember that averages mislead (in several different ways), and the larger the time interval the more they mislead. Unfortunately the Linux kernel doesn't directly provide better stats.)

The amount of time spent reading or writing (ruse or wuse) divided by the number of read or write IOs gives you the average IO completion time. This is not the device service time; it's the total time from initial submission, including waiting in kernel queues to be dispatched to the device. If you want the average wait time across all IOs, you should compute this as '(ruse + wuse) / (rio + wio)', because you want to weight the average wait time for each type of IO by how many of them there were.

(In other words, a lot of very fast reads and a few slow writes should give you a still very low average wait time.)

The amount of time that there has been at least one IO in flight (my use) gives you the device utilization over your time period; if you're generating per-second figures, it should never exceed 1, but it may be less. The average queue size is 'weighted milliseconds spend doing IOs' (my aveq) divided by use.

Total bandwidth (rsect + wsect) divided by use gives you what I will call a 'burst bandwidth' figure. Writing 100 MBytes to a SSD and to a HD over the course of a second gives you the same per-second IO rate for both; however, if the HD took the full second to complete the write (with a use of the full second) while the SSD took a quarter of a second (with a use of .25 second), the SSD was actually writing at 400 MBytes/sec instead of the HD's 100 MBytes/sec. Under most situations this is going to be close to the actual device bandwidth, because you won't have IOs in the queue without at least one being dispatched to the drive itself.

There are probably additional stats you can calculate that haven't occurred to me yet, partly because I don't think about this very often (it certainly feels like there should be additional clever derived stats). I only clued in to the potential importance of my 'burst bandwidth' figure today, in fact, as I was thinking about what else you could do with use.

If you want latency histograms for your IO and other more advanced things, you can turn to eBPF, for instance through the BCC tools. See also eg Brendan Gregg on eBPF. But all of that is beyond the scope of this entry, partly because I've done basically nothing with these eBPF tools yet for various reasons.

Sidebar: Disk IO stats for software RAID devices and LVM

The only non-zero stats that software RAID devices provide (at least for mirrors and stripes) is read and write IOs completed and the number of sectors read and written. Unfortunately we don't get any sort of time or utilization information for software RAID IO devices.

LVM devices (which on my Fedora systems show up in /proc/diskstats as 'dm-N') do appear to provide time information; I see non-zero values for my ruse, wuse, use, and aveq fields. I don't know how accurate they are and I haven't attempted to use any of my tools to see.

Written on 23 November 2018.
« Qualified praise for the Linux ss program
How we monitor our Prometheus setup itself »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Nov 23 23:30:56 2018
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.