Notes on Linux's blktrace
Blktrace is a tool that captures detailed traces of what's happening
in the kernel's block layer. In other words, think of it as a
tcpdump for disk IO. This is just the kind of thing you need to
deal with the fact that averages are misleading; since blktrace can tell you about the
timing of every request, you can use it as the starting point for
all sorts of detailed analysis. Or you can simply use it to verify
that you don't have any timing outliers in your disk IO. Blktrace
requires that your kernel be vaguely modern and have support for
block tracing enabled, but both of these are routine. I think that
most distributions have prepackaged versions of it that are just a
package manager command away; if not, see here.
Blktrace needs no special support or configuration apart from having
debugfs mounted on
/sys/kernel/debug, and if you don't have that
blktrace will tell you so you can fix it with a '
mount -t debugfs
My brief recent
blktrace have all been positive. It just works
and it doesn't seem to demand anything much from the system. I felt
comfortable applying it to a production machine and I didn't observe
any effects of doing so.
Here are some scattered notes on it:
- the manpage you really want to read for information on what you can
get out of blktrace is the blkparse manpage. The blktrace manpage is less interesting.
Unfortunately the blkparse manpage doesn't completely document
the output you'll see. When in doubt, you'll have to read the
blktrace -k' apparently doesn't always work and sometimes leaves you unable to do more block tracing until you reboot. Avoid it; just ^C your
- if you have a multi-CPU machine and blktrace has written multiple
output files per disk, you should only specify one CPU's file for
each disk in '
blkparse -i <file> ...'. blkparse will derive the other CPU's files from your single one and load them all. If you innocently give all of the files in
-iarguments, blkparse will process each of them repeatedly and print duplicate records in the trace data. Speaking from personal experience, this can lead to a fair degree of head-scratching.
(The easy way to do this is to just run it as '
blkparse [options] *.0'.)
blkparsealso produces a summary report at the end of the trace records. This is easy to miss if, like me, you make no attempt to read the entire trace output but instead start applying filtering via grep and awk and so on.
- The 'read depth' and 'write depth' numbers are the maximum queue
depth ever observed. They say nothing about the average (or the
distribution of values).
- the IO completion time deltas produced by
blkparse -tare relative to when the request was issued to the hardware driver. If you care enough about this, it's easy enough to change the blkparse code to do whatever you need. (In an ideal world there would be a command line option for this.)
- if you just use '
blkparse -t' without specifying any formats, the time deltas are in nanoseconds, not the microseconds that you'll get with a custom format and the
- in general, the default output for the various sorts of request
records contains stuff that you currently can't duplicate with
- the 'sequence numbers' are not unique, and I've seen the clock
at least seem to go backwards (I was using a custom format, so
maybe that had something to do with it). However, the order that
blkparseprints things in seems to be the right one.
I may have more notes and commentary later, when the dust around here has settled and I've explored blktrace more. But on the positive side, it was very easy to use blktrace and blkparse to verify that our problem was not with the disks and being masked by an average.
(Never let it be said that I don't take requests. In fact I love requests since they mean I don't have to come up with entry ideas myself.)