2018-11-23
Some Linux disk IO stats you can calculate from kernel information
I've written in the (distant) past about what disk IO stats you get from the Linux kernel and what per-partition stats you get. Now I'm interested in what additional stats you can calculate from these, especially stats that aren't entirely obvious.
These days, the kernel's Documentation/iostats.txt more or less fully documents all of the raw stats that the kernel makes available (it's even recently been updated to add some very recent new stats that are only available in 4.18+), and it's worth reading in general. However, for the rest of this entry I'm mostly going to use my names for the fields from my original entry for lack of better ones (iostats.txt describes the fields but doesn't give them short names).
The total number of reads or writes submitted to the kernel block layer is rio plus rmerge (or wio plus wmerge), since rio includes only reads that go to the disk. As it was more than a decade ago, the count of merged IOs (rmerge and wmerge) is increased when the IO is submitted, while the count of completed IOs only increases at the end of things. However, these days the count of sectors read and written is for completed IO, not submitted IO. If you ignore merged IOs (and I usually do), you can basically ignore the difference between completed IOs and submitted but not yet completed IOs.
(A discussion of the issues here is beyond the scope of this entry.)
The per-second rate of rio and wio over time is the average number of IOs per second. The per-second rate of rsect and wsect is similarly the average bandwidth per second (well, once you convert it from 512-byte 'sectors' to bytes or KB or whatever unit you like). As usual, averages can conceal wild variations and wild outliers. Bytes over time divided by requests over time gives you the average request size, either for a particular type of IO or in general.
(When using these stats, always remember that averages mislead (in several different ways), and the larger the time interval the more they mislead. Unfortunately the Linux kernel doesn't directly provide better stats.)
The amount of time spent reading or writing (ruse or wuse) divided by the number of read or write IOs gives you the average IO completion time. This is not the device service time; it's the total time from initial submission, including waiting in kernel queues to be dispatched to the device. If you want the average wait time across all IOs, you should compute this as '(ruse + wuse) / (rio + wio)', because you want to weight the average wait time for each type of IO by how many of them there were.
(In other words, a lot of very fast reads and a few slow writes should give you a still very low average wait time.)
The amount of time that there has been at least one IO in flight (my use) gives you the device utilization over your time period; if you're generating per-second figures, it should never exceed 1, but it may be less. The average queue size is 'weighted milliseconds spend doing IOs' (my aveq) divided by use.
Total bandwidth (rsect + wsect) divided by use gives you what I will call a 'burst bandwidth' figure. Writing 100 MBytes to a SSD and to a HD over the course of a second gives you the same per-second IO rate for both; however, if the HD took the full second to complete the write (with a use of the full second) while the SSD took a quarter of a second (with a use of .25 second), the SSD was actually writing at 400 MBytes/sec instead of the HD's 100 MBytes/sec. Under most situations this is going to be close to the actual device bandwidth, because you won't have IOs in the queue without at least one being dispatched to the drive itself.
There are probably additional stats you can calculate that haven't occurred to me yet, partly because I don't think about this very often (it certainly feels like there should be additional clever derived stats). I only clued in to the potential importance of my 'burst bandwidth' figure today, in fact, as I was thinking about what else you could do with use.
If you want latency histograms for your IO and other more advanced things, you can turn to eBPF, for instance through the BCC tools. See also eg Brendan Gregg on eBPF. But all of that is beyond the scope of this entry, partly because I've done basically nothing with these eBPF tools yet for various reasons.
Sidebar: Disk IO stats for software RAID devices and LVM
The only non-zero stats that software RAID devices provide (at least for mirrors and stripes) is read and write IOs completed and the number of sectors read and written. Unfortunately we don't get any sort of time or utilization information for software RAID IO devices.
LVM devices (which on my Fedora systems show up in /proc/diskstats
as 'dm-N') do appear to provide time information; I see non-zero
values for my ruse, wuse, use, and aveq fields. I don't
know how accurate they are and I haven't attempted to use any of
my tools to see.
Qualified praise for the Linux ss
program
For a long time now, I've reached for a combination of netstat
and 'lsof -n -i
' whenever I wanted to know things like who was
talking to what on a machine. Mostly I've tended to use lsof
,
even though it's slower, because I find netstat
to be vaguely
annoying (and I can never the exact options I want without checking
the manpage yet again). Recently I've started to use another program
for this, ss
,
which is part of the iproute2 suite (also
Wikipedia).
The advantage of ss
is that it will give you a bunch of useful
information, quite compactly, and it will do this very fast and
without fuss and bother. Do you want to know every listening TCP
socket and what program or programs are behind it? Then you want
'ss -tlp
'. The output is pretty parseable, which makes it easy
to feed to programs, and a fair bit of information is available
without root privileges. You can also have ss
filter the output
so that you don't have to, or at least so that you don't have to
do as much.
In addition, some of the information that ss
will give you is
relatively hard to get anywhere else (or at least easily) and can
be crucial to understanding network issues. For example, 'ss -i
'
will show you the PMTU and MSS of TCP connections, which can be
very useful for some sorts of network issues.
One recent case where I reached for ss
was I wanted to get a list
of connections to the local machine's port 25 and port 587, so I
could generate metrics information for how many SMTP connections our
mail servers were seeing. In ss
, the basic command for this is:
ss -t state established '( sport = :25 or sport = :587 )'
(Tracking this information was useful to establish that we really were seeing a blizzard of would-be spammers connecting to our external MX gateway and clogging up its available SMTP connections.)
Unfortunately, this is where the qualifications come in. As you can
see here, ss
has a filtering language, and a reasonably capable
one at that. Unfortunately, this filtering language is rather
underdocumented (much like many things in iproute2). Using ss
without any real documentation on its filtering language is kind
of frustrating, even when I'm not trying to write a filter expression.
There is probably a bunch of power that I could use, except it's
on the other side of a glass wall and I can't touch it. In theory
there's documentation somewhere; in practice I'm left reading other
people's articles like this and this copy of
the original documentation.
(This is my big lament about ss
.)
As you'll see if you play around with it, ss
also has a weird
output format for all of its extended information. I'm sure it makes
sense to its authors, and you can extract it with determination
('egrep -o
' will help),
but it isn't the easiest thing in the world to deal with. It's also
not the most readable thing in the world if you're using ss
interactively. It helps a bit to have a very wide terminal window.
Despite my gripes about it, I've wound up finding ss
an increasingly
important tool that I reach for more and more. Partly this is for
all of the information it can tell me, partly it's for the filtering
capabilities, and partly it's for its speed and low impact on the
system.
(Also, unlike lsof
, it doesn't complain about random things every
so often.)
(ss
was mentioned in passing back when I wrote about how there's
real reasons for Linux to replace ifconfig and netstat. I don't think of ss
as a replacement
for netstat
so much as something that effectively obsoletes it;
ss
is just better, even in its relatively scantily documented and
awkward state. With that said, modern Linux netstat
actually shows
more information than I was expecting, and in some ways it's in a
more convenient and readable form than ss
provides. I'm probably
still going to stick with ss
for various reasons.)