Notes on Linux's blktrace

October 5, 2012

Blktrace is a tool that captures detailed traces of what's happening in the kernel's block layer. In other words, think of it as a tcpdump for disk IO. This is just the kind of thing you need to deal with the fact that averages are misleading; since blktrace can tell you about the timing of every request, you can use it as the starting point for all sorts of detailed analysis. Or you can simply use it to verify that you don't have any timing outliers in your disk IO. Blktrace requires that your kernel be vaguely modern and have support for block tracing enabled, but both of these are routine. I think that most distributions have prepackaged versions of it that are just a package manager command away; if not, see here. Blktrace needs no special support or configuration apart from having debugfs mounted on /sys/kernel/debug, and if you don't have that blktrace will tell you so you can fix it with a 'mount -t debugfs debugfs /sys/kernel/debug'.

My brief recent experiences with blktrace have all been positive. It just works and it doesn't seem to demand anything much from the system. I felt comfortable applying it to a production machine and I didn't observe any effects of doing so.

Here are some scattered notes on it:

  • the manpage you really want to read for information on what you can get out of blktrace is the blkparse manpage. The blktrace manpage is less interesting. Unfortunately the blkparse manpage doesn't completely document the output you'll see. When in doubt, you'll have to read the source.

  • 'blktrace -k' apparently doesn't always work and sometimes leaves you unable to do more block tracing until you reboot. Avoid it; just ^C your blktrace commands.

  • if you have a multi-CPU machine and blktrace has written multiple output files per disk, you should only specify one CPU's file for each disk in 'blkparse -i <file> ...'. blkparse will derive the other CPU's files from your single one and load them all. If you innocently give all of the files in -i arguments, blkparse will process each of them repeatedly and print duplicate records in the trace data. Speaking from personal experience, this can lead to a fair degree of head-scratching.

    (The easy way to do this is to just run it as 'blkparse [options] *.0'.)

  • blkparse also produces a summary report at the end of the trace records. This is easy to miss if, like me, you make no attempt to read the entire trace output but instead start applying filtering via grep and awk and so on.
  • The 'read depth' and 'write depth' numbers are the maximum queue depth ever observed. They say nothing about the average (or the distribution of values).

  • the IO completion time deltas produced by blkparse -t are relative to when the request was issued to the hardware driver. If you care enough about this, it's easy enough to change the blkparse code to do whatever you need. (In an ideal world there would be a command line option for this.)

  • if you just use 'blkparse -t' without specifying any formats, the time deltas are in nanoseconds, not the microseconds that you'll get with a custom format and the %u formatting option.

  • in general, the default output for the various sorts of request records contains stuff that you currently can't duplicate with format strings.

  • the 'sequence numbers' are not unique, and I've seen the clock at least seem to go backwards (I was using a custom format, so maybe that had something to do with it). However, the order that blkparse prints things in seems to be the right one.

See also this blog entry on it from 2009 (via @spyced).

I may have more notes and commentary later, when the dust around here has settled and I've explored blktrace more. But on the positive side, it was very easy to use blktrace and blkparse to verify that our problem was not with the disks and being masked by an average.

(Never let it be said that I don't take requests. In fact I love requests since they mean I don't have to come up with entry ideas myself.)

Written on 05 October 2012.
« Averages mislead: an illustrated example
How averages mislead you »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Oct 5 02:07:45 2012
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.