DTrace: counting, aggregates, and a nice undocumented printa() feature

October 13, 2012

Suppose that you want to keep a running count of something in a DTrace script, for example the number of outstanding disk IO requests. The obvious and most flexible way to do this is to use an ordinary DTrace global variable or array; you can even find examples of this in various DTrace scripts. Unfortunately, this only works in restricted situations. If you use this generally, in a number of situations you'll sooner or later experience lost counter events. My guess is that DTrace deliberately doesn't attempt to do atomic updates of its global variables (at least not with '+='), so DTrace on two CPUs can race each other and overwrite values.

(Guess what I spent part of a day discovering recently.)

What you want to use instead is an aggregate that you increment and decrement with sum(1) and sum(-1):

fbt::zio_create:return
{ @zactive = sum(1);  tracked[args[1]] = 1; }

fbt::zio_done:entry
/ tracked[args[0]] /
{ @zactive = sum(-1); tracked[args[0]] = 0; }

This is fully reliable and will not miscount events. You need the funny condition on zio_done because there are likely to be in-flight ZIOs when your DTrace script starts; if you do not guard the counter decrement this way, you will decrement your counter when they complete without ever having incremented it, and your counter will go negative.

(I made this mistake too. I've had lots of learning experiences with DTrace recently.)

You can use a keyed aggregate to track per-thing events, for example tracking the in-flight ZIOs per pool. The unfortunate drawback to this is that as far as I know, there's no way to get or print just a single key from an aggregate. My example is that I would like to track in-flight IOs on a per-pool basis and then when a ZFS IO takes a long time dump a bunch of information about it including the count of active IOs for that pool only. But as far as I know there's no way to do this; I can only dump the active counts for all pools. This is one of the situations where a global array would be much better, because then I could pull out specific keys.

Another limitation of printing aggregates comes up here; there's no way to not print aggregate keys where the value is zero or to drop only keys with a zero value from an aggregate. At the most I'm only interested in pools with currently in-flight IOs, but I have to dump out a long list of pools that have none because DTrace gives me no choice.

(The sad bit is that I believe that all of the aggregate formatting and output is all in the user side of DTrace instead of in the kernel, so in theory you could build a new version that did all of this. But I suspect the official DTrace answer is 'use a program to postprocess the raw output of your DTrace script'. My reaction to that does not fit in the margins of this entry.)

In general, my opinion is that you want to use aggregates for any global variable that you'll do arithmetic on as your DTrace script is running unless you're completely sure that you have no multi-thread races. There are situations where you can be sure of this (for example, you are tracking something that is itself intrinsically single-threaded), but be careful.

Now we get to the very useful but completely undocumented feature of printa(), which is handy when you're using aggregates as basically thread-safe global variables. Namely, printa() can print multiple aggregates at once. This looks like:

tick-10sec
{ 
  [...]
  /* aggregate used as simple global variable */
  printa("Total ZIO: %@5d MB read %@5d MB write\n",
         @zread, @zwrite);
  [...]
  /* keyed aggregate, by pool name */
  printf("Active ZIO in pools:\n");
  printa("%@6d in %-16s  %@3d read / %@3d write\n",
         @zactive, @zractive, @zwactive);
  [...]
}

(For basic format string usage of printa(), see the end of here.)

There are two special limitations of multi-aggregate printa()s for keyed aggregates. First, the order is determined by the first aggregate to be printed (and as usual, you only get smallest to largest). Second, you can't usefully truncate aggregates to show only the top N (or the bottom N) entries.

The problem with truncation is that a keyed multi-aggregate printa() prints a record for each key that's in any aggregate (filling in zeros if some of the other aggregates don't have data for that key). If you truncate one aggregate to the top N entries, all the other aggregates still have their full keys; if you truncate all aggregates to the top N entries, they're highly likely to all have a different top N so you'll still wind up with more than N keys (and a lot of zero fields being printed).

(In other words, keyed multi-aggregate printa() is like a SQL full outer join.)

PS: dealing with this made me realize why printa() with a format doesn't force newlines; it allows you to intermix printf() for formatting regular variables and printa() to get your aggregate global variables out, all on a single printed line.

Written on 13 October 2012.
« SSL CAs have an impossible job (if you want them to be thorough)
DTrace: two mistakes you can make in DTrace scripts »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sat Oct 13 02:39:29 2012
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.