Wandering Thoughts archives


Some Linux disk IO stats you can calculate from kernel information

I've written in the (distant) past about what disk IO stats you get from the Linux kernel and what per-partition stats you get. Now I'm interested in what additional stats you can calculate from these, especially stats that aren't entirely obvious.

These days, the kernel's Documentation/iostats.txt more or less fully documents all of the raw stats that the kernel makes available (it's even recently been updated to add some very recent new stats that are only available in 4.18+), and it's worth reading in general. However, for the rest of this entry I'm mostly going to use my names for the fields from my original entry for lack of better ones (iostats.txt describes the fields but doesn't give them short names).

The total number of reads or writes submitted to the kernel block layer is rio plus rmerge (or wio plus wmerge), since rio includes only reads that go to the disk. As it was more than a decade ago, the count of merged IOs (rmerge and wmerge) is increased when the IO is submitted, while the count of completed IOs only increases at the end of things. However, these days the count of sectors read and written is for completed IO, not submitted IO. If you ignore merged IOs (and I usually do), you can basically ignore the difference between completed IOs and submitted but not yet completed IOs.

(A discussion of the issues here is beyond the scope of this entry.)

The per-second rate of rio and wio over time is the average number of IOs per second. The per-second rate of rsect and wsect is similarly the average bandwidth per second (well, once you convert it from 512-byte 'sectors' to bytes or KB or whatever unit you like). As usual, averages can conceal wild variations and wild outliers. Bytes over time divided by requests over time gives you the average request size, either for a particular type of IO or in general.

(When using these stats, always remember that averages mislead (in several different ways), and the larger the time interval the more they mislead. Unfortunately the Linux kernel doesn't directly provide better stats.)

The amount of time spent reading or writing (ruse or wuse) divided by the number of read or write IOs gives you the average IO completion time. This is not the device service time; it's the total time from initial submission, including waiting in kernel queues to be dispatched to the device. If you want the average wait time across all IOs, you should compute this as '(ruse + wuse) / (rio + wio)', because you want to weight the average wait time for each type of IO by how many of them there were.

(In other words, a lot of very fast reads and a few slow writes should give you a still very low average wait time.)

The amount of time that there has been at least one IO in flight (my use) gives you the device utilization over your time period; if you're generating per-second figures, it should never exceed 1, but it may be less. The average queue size is 'weighted milliseconds spend doing IOs' (my aveq) divided by use.

Total bandwidth (rsect + wsect) divided by use gives you what I will call a 'burst bandwidth' figure. Writing 100 MBytes to a SSD and to a HD over the course of a second gives you the same per-second IO rate for both; however, if the HD took the full second to complete the write (with a use of the full second) while the SSD took a quarter of a second (with a use of .25 second), the SSD was actually writing at 400 MBytes/sec instead of the HD's 100 MBytes/sec. Under most situations this is going to be close to the actual device bandwidth, because you won't have IOs in the queue without at least one being dispatched to the drive itself.

There are probably additional stats you can calculate that haven't occurred to me yet, partly because I don't think about this very often (it certainly feels like there should be additional clever derived stats). I only clued in to the potential importance of my 'burst bandwidth' figure today, in fact, as I was thinking about what else you could do with use.

If you want latency histograms for your IO and other more advanced things, you can turn to eBPF, for instance through the BCC tools. See also eg Brendan Gregg on eBPF. But all of that is beyond the scope of this entry, partly because I've done basically nothing with these eBPF tools yet for various reasons.

Sidebar: Disk IO stats for software RAID devices and LVM

The only non-zero stats that software RAID devices provide (at least for mirrors and stripes) is read and write IOs completed and the number of sectors read and written. Unfortunately we don't get any sort of time or utilization information for software RAID IO devices.

LVM devices (which on my Fedora systems show up in /proc/diskstats as 'dm-N') do appear to provide time information; I see non-zero values for my ruse, wuse, use, and aveq fields. I don't know how accurate they are and I haven't attempted to use any of my tools to see.

linux/DiskIOStatsIII written at 23:30:56; Add Comment

Qualified praise for the Linux ss program

For a long time now, I've reached for a combination of netstat and 'lsof -n -i' whenever I wanted to know things like who was talking to what on a machine. Mostly I've tended to use lsof, even though it's slower, because I find netstat to be vaguely annoying (and I can never the exact options I want without checking the manpage yet again). Recently I've started to use another program for this, ss, which is part of the iproute2 suite (also Wikipedia).

The advantage of ss is that it will give you a bunch of useful information, quite compactly, and it will do this very fast and without fuss and bother. Do you want to know every listening TCP socket and what program or programs are behind it? Then you want 'ss -tlp'. The output is pretty parseable, which makes it easy to feed to programs, and a fair bit of information is available without root privileges. You can also have ss filter the output so that you don't have to, or at least so that you don't have to do as much.

In addition, some of the information that ss will give you is relatively hard to get anywhere else (or at least easily) and can be crucial to understanding network issues. For example, 'ss -i' will show you the PMTU and MSS of TCP connections, which can be very useful for some sorts of network issues.

One recent case where I reached for ss was I wanted to get a list of connections to the local machine's port 25 and port 587, so I could generate metrics information for how many SMTP connections our mail servers were seeing. In ss, the basic command for this is:

ss -t state established '( sport = :25 or sport = :587 )'

(Tracking this information was useful to establish that we really were seeing a blizzard of would-be spammers connecting to our external MX gateway and clogging up its available SMTP connections.)

Unfortunately, this is where the qualifications come in. As you can see here, ss has a filtering language, and a reasonably capable one at that. Unfortunately, this filtering language is rather underdocumented (much like many things in iproute2). Using ss without any real documentation on its filtering language is kind of frustrating, even when I'm not trying to write a filter expression. There is probably a bunch of power that I could use, except it's on the other side of a glass wall and I can't touch it. In theory there's documentation somewhere; in practice I'm left reading other people's articles like this and this copy of the original documentation.

(This is my big lament about ss.)

As you'll see if you play around with it, ss also has a weird output format for all of its extended information. I'm sure it makes sense to its authors, and you can extract it with determination ('egrep -o' will help), but it isn't the easiest thing in the world to deal with. It's also not the most readable thing in the world if you're using ss interactively. It helps a bit to have a very wide terminal window.

Despite my gripes about it, I've wound up finding ss an increasingly important tool that I reach for more and more. Partly this is for all of the information it can tell me, partly it's for the filtering capabilities, and partly it's for its speed and low impact on the system.

(Also, unlike lsof, it doesn't complain about random things every so often.)

(ss was mentioned in passing back when I wrote about how there's real reasons for Linux to replace ifconfig and netstat. I don't think of ss as a replacement for netstat so much as something that effectively obsoletes it; ss is just better, even in its relatively scantily documented and awkward state. With that said, modern Linux netstat actually shows more information than I was expecting, and in some ways it's in a more convenient and readable form than ss provides. I'm probably still going to stick with ss for various reasons.)

linux/SsQualifiedPraise written at 00:30:49; Add Comment

Page tools: See As Normal.
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.