2012-10-31
Our DTrace scripts for NFS server, ZFS, and iSCSI initiator monitoring
As a result of recent events, I've built up a collection of DTrace scripts for monitoring and reporting on our fileserver environment, where we use NFS v3 on top of ZFS on top of iSCSI. Since I grumbled earlier about the lack of easily findable DTrace scripts for this, I've made our scripts public on Github as siebenmann/cks-dtrace (where you can read more details about what's there). They're written for Solaris 10 update 8 (plus some patches) and do play around with kernel data structures.
These scripts are somewhat specific to our environment and contain various local assumptions (some of them commented). They're also not the best DTrace code possible, and in fact they contain several generations of my DTrace code as I steadily learned more about what I was doing (if I was very keen, I would go back to rewrite the older code in the current best style I know).
In addition to their straightforward use, these scripts may serve as a useful example of both how to do various things with DTrace and how to extract various bits of information from the Solaris kernel. In an ideal world there would be a DTrace wiki with information of the form 'given an X, here's how you get a Y' (such as 'given a vnode, here's how you get its ZFS pool'), but as far as I know right now you have to find various little tricks in various people's DTrace scripts.
(I'd be overjoyed to be wrong about this.)
In the 'giving proper credit' department: I didn't come up with these scripts in a void, using only my ingenuity; instead, I stand on the shoulders of many giants. I would not have gotten anywhere near as far I have without taking all sorts of DTrace tricks and clever ways of extracting various bits of kernel information from other people's DTrace scripts, often ZFS-related scripts. Useful references that I have in my notes include Brendan Gregg (and various scripts in the DTrace book's chapter 5) and Richard Elling's zilstat.
(And as always, all of these scripts would have been basically impossible without the OpenSolaris kernel code. That lack of kernel source code cripples DTrace is one reason I remain quite angry with Oracle's decision to close Solaris 11 source. The more I use DTrace, the more convinced I am that we'll never move to Solaris 11 (unless Oracle has a total change of heart).)
2012-10-29
How ZFS file prefetching seems to work
Since I was just digging into this, I want to write down what I've learned before I forget it. I will give you the punchline right up front: if you want to do IO that avoids ZFS prefetching as much as possible, you need to use randomized IO. It also turns out that ZFS prefetching has what I consider some flaws that can cause IO heartburn.
(You should use a fixed seed for your randomization function and watch ZFS statistics to make sure that you really are avoiding the prefetcher. Change the seed as necessary.)
So here is what I've deduced from reading the ZFS code in OpenSolaris
and from some experimentation with DTrace instrumentation. First off,
ZFS read prefetching operates on ZFS blocks, which are generally 128
Kb (the recordsize ZFS property), and uses what it calls 'streams'.
Each stream can recognize one pattern of sequential IO: linear forward
reads, linear backwards reads, and forward or backwards reads with a
'stride' (where you skip forward or backwards N blocks every time you do
a read). Each file can have up to eight separate streams associated with
it. How streams are created and deleted is complex and I don't entirely
understand it, but I believe the basic approach is that ZFS attempts to
match your read IO with an existing stream and if it can't it tries
to create a new stream (there is some tricky code to discover strided
reads).
When a stream matches an IO pattern, it triggers a prefetch read; this fetches some amount of reads ahead of what you're reading now. The more times the stream matches, the more it reads ahead; how much it reads ahead starts at one read and more or less doubles every time the stream is used, up to a maximum size. The prefetch is for the size of read you're normally doing (so if you read one block, the prefetch is for one block), but it may stride forward to cover multiple expected future reads. For example, if you are reading one block every 10 blocks, after a while it will be fetching one block 10 blocks ahead, one block 20 blocks ahead, one block 30 blocks ahead, and so on.
What makes it basically impossible to avoid this prefetching with a sufficiently cleverly chosen and unpredictable pattern is how streams get recycled. Normally you might think that new IOs and new (attempted) streams would push out old streams, so if you just flood ZFS with a sufficiently long sequence of unpatterned IO you would be fine. It turns out that this is not how ZFS does it; streams only get recycled based on time since last use. A stream must be unused for at least 2 seconds before it will be tossed in favour of a new attempt to create a stream. So if you are doing a bunch of unpredictable IO on a file, your first eight or so IOs will create eight initial streams, which will then sit there for two seconds attempting to match themselves up with some IO you're doing. Only after those 2 seconds will they start to be tossed in favour of new ones (and then this cycle repeats for another two seconds or so, assuming that you can do completely unpatterned IO).
Given that two seconds covers both a significant amount of IO and worse, an unpredictable amount, this is why I say that ZFS prefetching can only really be defeated by randomized IO.
This prefetching approach has what I consider to be several flaws. The largest single flaw is ZFS prefetching does not check to see that its work was useful. What it cares about is that it matched your IO pattern; it doesn't notice if the data it prefetched for you expired from the ARC unread. Under memory pressure this combines explosively with two other ZFS prefetching features. First, prefetch streams seem to be valid forever as long as the znode for their file stays in memory; they are more or less never declared too old to be useful. Second, as previously mentioned a successful match against your IO pattern triggers more and more prefetching, even if this prefetched data will just expire unread from the ARC. Now that I understand the ZFS prefetching mechanisms, I can see why our prefetching problems happened; sequential reads from a lot of files at once is a worst case for ZFS prefetching.
This shows how vital it is for any prefetch mechanism to have end to end feedback. Prefetch should not be considered successful until the user-level code itself has read the prefetched data from cache. Merely predicting the user-level IO pattern is not good enough except in favorable situations.
(For example, as far as I can see ZFS will happily prefetch the maximum amount for a file that you are reading through sequentially at a rate of one read every five minutes even in the presence of a huge IO load that evicts the prefetched data from the ARC long before you issue the next read.)
Sidebar: the tunables and code involved in this
All of the prefetching code is in uts/common/fs/zfs/dmu_zfetch.c in your handy OpenSolaris or Illumos source repository. In OpenSolaris, the available tunables are:
zfs_prefetch_disable |
Disable prefetching. |
zfetch_max_streams |
Maximum number of streams per file; defaults to 8. |
zfetch_block_cap |
The maximum number of blocks that we can ever prefetch when a stream triggers prefetch. These blocks may be split up between multiple IO positions. Defaults to 256 blocks. |
zfetch_array_rd_sz |
Do not do prefetching for reads larger than this size (in bytes). Defaults to 1 MB. |
zfetch_min_sec_reap |
Minimum seconds before an inactive stream can be reclaimed; defaults to 2 seconds. |
All of these tunables are global ones; they affect all ZFS filesystems and all ZFS pools.
2012-10-14
DTrace: two mistakes you can make in DTrace scripts
There are certain mistakes that you can make when writing DTrace scripts that are worth pointing out, partly because I've made them myself and partly because I've seen them in other people's scripts on the net. To serve as an example of this, suppose that you want to track ZFS read and write volume, so you write the following:
fbt::zfs_read:entry
{ @zreads = sum(args[1]->uio_resid); }
fbt::zfs_write:entry
{ @zwrites = sum(args[1]->uio_resid); }
This is perfectly good DTrace code and it works fine most of the time, but it has two significant issues.
The first is what I'll call the initiated versus completed problem.
Because we are tracing on entry to zfs_read and zfs_write,
we're counting IO activity when it's initiated, not when it's performed
(whatever that means) or when it's completed. If IO completes rapidly
and we're using a long time base, this doesn't matter but if either are
potentially false this may give you misleading or confusing numbers,
ones that do not match up to the reality as experienced by applications.
When to account for activity is in general a hard problem with no clear generic answers; how you want to do it can depend very much on what questions you want answers to. But my default assumption is to account for things on completion, especially for IO, because that's when a user-level program can go on; it's also clearly counting the past instead of anticipating the future. You may have different answers, but you definitely want to think about the issue. And if you pick 'counted when initiated', you probably should write some sort of comment about why this makes sense in your particular case.
The second problem is that this blithely ignores the possibility of
errors of all sorts. Both zfs_read and zfs_write can be asked to
do impossible things and can run into problems; in both cases, some or
all of the IO will not get done. Accounting for things when you enter a
routine necessarily means ignoring errors, but you can make this mistake
even when you trace returns (I speak from personal experience here;
fortunately it gave me crazy numbers and some other problems until I
realized what was going on).
(Note that you don't need to run into actual errors to get incorrect results with reading, because short reads can easily happen; consider issuing a large read when you're almost at the end of a file.)
Unfortunately, knowing what sort of errors are possible and how to compensate for them generally takes reading (Open)Solaris kernel source and probably some experimentation. I have to admit that right now I don't know how to do fully correct IO size accounting for these routines, for example, because I'm not sure how to get the actual amount read or written for short reads and writes. It may be that these are uncommon enough that you can ignore them and just handle clear errors.
(One of the pernicious issues with both of these mistakes is that much of the time they won't matter. If your IO almost never results in errors or short IO, completes rapidly, and you're only looking at IO volume over a long time, you're not going to be affected. I'm lucky (in a sense) that I've been dealing with a number of situations where all of this was not true so I've been smacked in the nose by this.)
2012-10-13
DTrace: counting, aggregates, and a nice undocumented printa() feature
Suppose that you want to keep a running count of something in a DTrace
script, for example the number of outstanding disk IO requests. The
obvious and most flexible way to do this is to use an ordinary DTrace
global variable or array; you can even find examples of this in
various DTrace scripts. Unfortunately, this only works in restricted
situations. If you use this generally, in a number of situations
you'll sooner or later experience lost counter events. My guess is that
DTrace deliberately doesn't attempt to do atomic updates of its global
variables (at least not with '+='), so DTrace on two CPUs can race
each other and overwrite values.
(Guess what I spent part of a day discovering recently.)
What you want to use instead is an aggregate that you increment and
decrement with sum(1) and sum(-1):
fbt::zio_create:return
{ @zactive = sum(1); tracked[args[1]] = 1; }
fbt::zio_done:entry
/ tracked[args[0]] /
{ @zactive = sum(-1); tracked[args[0]] = 0; }
This is fully reliable and will not miscount events. You need the funny
condition on zio_done because there are likely to be in-flight
ZIOs when your DTrace script starts; if you do not guard the counter
decrement this way, you will decrement your counter when they complete
without ever having incremented it, and your counter will go negative.
(I made this mistake too. I've had lots of learning experiences with DTrace recently.)
You can use a keyed aggregate to track per-thing events, for example tracking the in-flight ZIOs per pool. The unfortunate drawback to this is that as far as I know, there's no way to get or print just a single key from an aggregate. My example is that I would like to track in-flight IOs on a per-pool basis and then when a ZFS IO takes a long time dump a bunch of information about it including the count of active IOs for that pool only. But as far as I know there's no way to do this; I can only dump the active counts for all pools. This is one of the situations where a global array would be much better, because then I could pull out specific keys.
Another limitation of printing aggregates comes up here; there's no way to not print aggregate keys where the value is zero or to drop only keys with a zero value from an aggregate. At the most I'm only interested in pools with currently in-flight IOs, but I have to dump out a long list of pools that have none because DTrace gives me no choice.
(The sad bit is that I believe that all of the aggregate formatting and output is all in the user side of DTrace instead of in the kernel, so in theory you could build a new version that did all of this. But I suspect the official DTrace answer is 'use a program to postprocess the raw output of your DTrace script'. My reaction to that does not fit in the margins of this entry.)
In general, my opinion is that you want to use aggregates for any global variable that you'll do arithmetic on as your DTrace script is running unless you're completely sure that you have no multi-thread races. There are situations where you can be sure of this (for example, you are tracking something that is itself intrinsically single-threaded), but be careful.
Now we get to the very useful but completely undocumented feature of
printa(), which is handy when you're using aggregates as basically
thread-safe global variables. Namely, printa() can print multiple
aggregates at once. This looks like:
tick-10sec
{
[...]
/* aggregate used as simple global variable */
printa("Total ZIO: %@5d MB read %@5d MB write\n",
@zread, @zwrite);
[...]
/* keyed aggregate, by pool name */
printf("Active ZIO in pools:\n");
printa("%@6d in %-16s %@3d read / %@3d write\n",
@zactive, @zractive, @zwactive);
[...]
}
(For basic format string usage of printa(), see the end of here.)
There are two special limitations of multi-aggregate printa()s for
keyed aggregates. First, the order is determined by the first aggregate
to be printed (and as usual, you only get smallest to largest). Second,
you can't usefully truncate aggregates to show only the top N (or the
bottom N) entries.
The problem with truncation is that a keyed multi-aggregate printa()
prints a record for each key that's in any aggregate (filling in zeros
if some of the other aggregates don't have data for that key). If you
truncate one aggregate to the top N entries, all the other aggregates
still have their full keys; if you truncate all aggregates to the top N
entries, they're highly likely to all have a different top N so you'll
still wind up with more than N keys (and a lot of zero fields being
printed).
(In other words, keyed multi-aggregate printa() is like a SQL full
outer join.)
PS: dealing with this made me realize why printa() with a format
doesn't force newlines; it allows you to intermix printf() for
formatting regular variables and printa() to get your aggregate global
variables out, all on a single printed line.
2012-10-03
DTrace: notes on quantization and aggregates
Following on the same theme as yesterday, here's
some things about DTrace's aggregates and quantization that I want to
remember. As a bonus I'm going to throw in random bits and pieces about
using printa().
(I should mention that all of this is on Solaris 10 update 8. Yes, this is almost fossilized by now. No, we're not interested in upgrading.)
As I mentioned yesterday, I've come around to the view that you should
never use the anonymous aggregate @[...]. Using a named aggregate is
just as easy and it avoids all sorts of really annoying errors when you
later try to reuse @[...] for a second quick aggregation in another
section of the script with different keys or quantization. (I've done
that.)
The standard D style is to use an aggregate in only one place or for
only one thing. However, this is not actually required. As far as I can
tell, you can stuff anything in an aggregate provided that it uses the
same number and type of keys and the same aggregation function; for
lquantize(), the sizing parameters must also match. What this turns
aggregates into is a way of grouping things together (or by deliberately
changing aggregate names, they become a way of splitting things apart).
(It's useful to know that things like probefunc are considered
strings, so literal strings as keys are compatible with them.)
In particular, if you're aggregating and quantizing times (durations, latencies, whatever) it's perfectly fine to have different things in different time units. You might measure some things in milliseconds, some things in microseconds, and some things in quarter milliseconds for an intermediate resolution. They can all go in the same aggregate, even. You'll want to make sure that the key includes some indirection of the (time) unit scale or you'll probably get very confused.
Similarly it's perfectly fine to repeatedly quantize the same data in
different ways in different aggregates, so you're taking both a general
quantize() overview and then using lquantize() to example specific
sections in detail. The one drawback to doing this the straightforward
way is that all of the aggregation will include all of the data. This
can can make DTrace's ASCII bar graphs kind of useless because your
actual data will be crushed by the out-of-range values on either
side. Getting around this means recording only selected data in the
aggregate, which means using a separate clause so you can make it
conditional so it only applies to values inside your range of interest.
To give you an example:
fbt::rfs3_*:entry
{ self->ts = timestamp;
@counts[probefunc] = count();
}
fbt::rfs3_*:return
/ self->ts > 0 /
{ this->delta = (timestamp-self->ts)/1000000; }
fbt::rfs3_*:return
/ self->ts > 0 && this->delta > 100 /
{ @slow[probefunc, "ms"] = lquantize(this->delta, 100, 500, 50); }
fbt::rfs3_*:return
/ self->ts > 0 /
{ @dist[probefunc, "ms"] = quantize(this->delta);
self->ts = 0;
}
(I'm using what I suspect is an obscure feature of this to save some
space and verbosity in this example. It's not necessarily good style.
Also notice how I've embedded the time unit in the aggregation key so
that it'll normally show up when the aggregate is printed.)
Here we're both getting a general distribution over all NFS v3 server
operations (more or less) and also getting a look at roughly how those
between 100 msec and 500 msec break down. If we included all data in the
lquantize() version it would be very hard to see any pattern in the
100-500 msec breakdown because almost all of the data points would be
under 100 msec (well, we hope).
(PS: that this actually reports results on our systems is one of the reasons that I am now digging into this stuff.)
Finally, even with the command line switch or pragma to quiet dtrace
down, every aggregation with data will be printed, either with a
printa() by you or by DTrace itself if you don't print it explicitly.
You can probably get out of this with an END clause that clear()'s
aggregations that you don't want to be displayed. I believe that
individual entries in an aggregate are printed in order of increasing
numbers. This gets a little odd for quantizations; there I suspect but
don't know for sure that the order is determined by something like the
sum of all values submitted to the quantization. It's not particularly
useful in any case.
Sidebar: a bit on printa()
The basic version of printa() with a format string is:
END
{ printf("\n");
printa("function %-20s %@10d\n", @counts);
printf("\n");
printa("function %s(), time in %s:%@d\n", @dist);
printf("\n");
printa("function %s(), time in %s for >= 100 ms:%@d\n", @slow);
}
The %@... bit is the format of the aggregate, and can occur anywhere
in the format string; in practice the only useful thing to do to it
is to give a field width (as this does for @counts). The other %
formatting is for each element of the key, which are used in order but
don't have to be completely used up.
The printf("\n")s are normally necessary because printa() without a
formatting string adds an extra newline before and after printing the
entire aggregate. With a format string, it doesn't. The newline in the
format strings simply separates individual keys from the aggregate.
(This behavior of printa() is inconsistent, because if you just
let DTrace dump the aggregates out it doesn't add the newline.)
2012-10-02
Some notes on getting durations in DTrace
One of the things you commonly wind up doing in DTrace is computing durations, how long things take; this gives you latencies, among other useful things. Since we have a significant performance problem with our mail spool right now, durations and latencies have very much been my focus recently. In the process of trying to apply DTrace to the problem I've wound up with some things that I want to remember about this.
What I've seen as the standard or classic DTrace pattern for getting durations is to timestamp the start of an event and then calculate the duration at the end of it:
fbt::fop_read:entry
{ self->ts = timestamp; }
fbt::fop_read:return
{ @read[probefunc] = quantize((timestamp - self->ts) / 1000000); }
(The timestamp DTrace variable is in nanoseconds, so here we scale
it to milliseconds. Also I've become convinced that you never want to
use the anonymous '@[...]' aggregate if you're writing an actual D
script, even if you have only one aggregate.)
This works (in this case), but there's some things that can be better.
First, you generally want to make the clause that calculates the
duration conditional on the initial timestamp being set, especially if
the start and the end points are different functions or probe spots (the
usual symptom of failing to do this is a few extremely large values that
result from the calculation being 'timestamp - 0'). You also want to
explicitly clear the timestamp after calculating the duration; this can
help DTrace use less memory and avoids accidents (again, especially if
the start and the end point are different). So:
fbt::fop_read:return
/ self->ts > 0 /
{ @[probefunc] = quantize(....);
self->ts = 0; }
Sometimes you only want to track some events, not all of them. For example, you might only be interested in low-level ZFS IO stats for read IO:
fbt::zio_create:return
/ args[1]->io_type == 1 /
{ .... }
Once you've made the start conditional, you don't need to also check
the condition at the end point; you can simply be conditional on the
timestamp being set. If the timestamp is set you know that you passed
the condition at the start. If you do have additional conditions on
the end point, you should make sure to unconditionally clear the
timestamp. Since DTrace doesn't have an if or the equivalent, you get
to do this the awkward way with a separate clause:
fbt::fop_read:return
/ self->ts > 0 && (timestamp - self->ts) > 250000 /
{ .... }
fbt::fop_read:return
{ self->ts = 0; }
Don't flip the order of these two. DTrace processes clauses in the order you've written them, so if you had them the other way around you'd clear the start timestamp before calculating anything with it.
(On a side note, it would be really great if quantize() and
lquantize() could optionally ignore all values outside a specified
range.)
In the simple case you can store the start timestamp (or timestamps) on
the special self thread-local variable. However I believe that this is
only useful for synchronous operations. If you want to track the latency
of something that involves asynchronous operations, callbacks, and
similar things you need to find a unique key and then use an associative
array to store the start timestamp. So:
fbt::zio_create:return
/ args[1]->io_type /
{ ziots[args[1]] = timestamp; }
fbt::zio_done:entry
/ ziots[args[0]] > 0 /
{ ....;
ziots[args[0]] = 0; }
If you're doing this it's especially important to clear the timestamp by
assigning zero to it. Not only do you avoid accidents if the same memory
address (or other key value) is later reused for another operation
(which is actually quite likely in some cases, especially for the
addresses of structures), but it's also the only way that now-finished
things will get cleaned out of the ziots associative array.
(I'm sure that all of these are well known in the active DTrace community and probably are documented in some best practices or common tricks pages. I just haven't been sufficiently exposed to the DTrace world to have found this stuff yet.)
Sidebar: why structure address reuse is quite likely
Solaris does most of its allocation through slab allocation, generally with a structure-specific slab for any commonly used sort of object. It also has caches for many sorts of objects that are allocated and deallocated frequently enough to justify it. Thus, if you're tracking any sort of commonly used object it's very likely that when a specific instance is freed it'll be reused in short order. Either it will be near the top of the applicable 'recently freed' cache or it will get put back in its slab pool and then shortly allocated from the pool again.