Wandering Thoughts archives

2012-10-31

Our DTrace scripts for NFS server, ZFS, and iSCSI initiator monitoring

As a result of recent events, I've built up a collection of DTrace scripts for monitoring and reporting on our fileserver environment, where we use NFS v3 on top of ZFS on top of iSCSI. Since I grumbled earlier about the lack of easily findable DTrace scripts for this, I've made our scripts public on Github as siebenmann/cks-dtrace (where you can read more details about what's there). They're written for Solaris 10 update 8 (plus some patches) and do play around with kernel data structures.

These scripts are somewhat specific to our environment and contain various local assumptions (some of them commented). They're also not the best DTrace code possible, and in fact they contain several generations of my DTrace code as I steadily learned more about what I was doing (if I was very keen, I would go back to rewrite the older code in the current best style I know).

In addition to their straightforward use, these scripts may serve as a useful example of both how to do various things with DTrace and how to extract various bits of information from the Solaris kernel. In an ideal world there would be a DTrace wiki with information of the form 'given an X, here's how you get a Y' (such as 'given a vnode, here's how you get its ZFS pool'), but as far as I know right now you have to find various little tricks in various people's DTrace scripts.

(I'd be overjoyed to be wrong about this.)

In the 'giving proper credit' department: I didn't come up with these scripts in a void, using only my ingenuity; instead, I stand on the shoulders of many giants. I would not have gotten anywhere near as far I have without taking all sorts of DTrace tricks and clever ways of extracting various bits of kernel information from other people's DTrace scripts, often ZFS-related scripts. Useful references that I have in my notes include Brendan Gregg (and various scripts in the DTrace book's chapter 5) and Richard Elling's zilstat.

(And as always, all of these scripts would have been basically impossible without the OpenSolaris kernel code. That lack of kernel source code cripples DTrace is one reason I remain quite angry with Oracle's decision to close Solaris 11 source. The more I use DTrace, the more convinced I am that we'll never move to Solaris 11 (unless Oracle has a total change of heart).)

ZFSDTraceScripts written at 22:59:57; Add Comment

2012-10-29

How ZFS file prefetching seems to work

Since I was just digging into this, I want to write down what I've learned before I forget it. I will give you the punchline right up front: if you want to do IO that avoids ZFS prefetching as much as possible, you need to use randomized IO. It also turns out that ZFS prefetching has what I consider some flaws that can cause IO heartburn.

(You should use a fixed seed for your randomization function and watch ZFS statistics to make sure that you really are avoiding the prefetcher. Change the seed as necessary.)

So here is what I've deduced from reading the ZFS code in OpenSolaris and from some experimentation with DTrace instrumentation. First off, ZFS read prefetching operates on ZFS blocks, which are generally 128 Kb (the recordsize ZFS property), and uses what it calls 'streams'. Each stream can recognize one pattern of sequential IO: linear forward reads, linear backwards reads, and forward or backwards reads with a 'stride' (where you skip forward or backwards N blocks every time you do a read). Each file can have up to eight separate streams associated with it. How streams are created and deleted is complex and I don't entirely understand it, but I believe the basic approach is that ZFS attempts to match your read IO with an existing stream and if it can't it tries to create a new stream (there is some tricky code to discover strided reads).

When a stream matches an IO pattern, it triggers a prefetch read; this fetches some amount of reads ahead of what you're reading now. The more times the stream matches, the more it reads ahead; how much it reads ahead starts at one read and more or less doubles every time the stream is used, up to a maximum size. The prefetch is for the size of read you're normally doing (so if you read one block, the prefetch is for one block), but it may stride forward to cover multiple expected future reads. For example, if you are reading one block every 10 blocks, after a while it will be fetching one block 10 blocks ahead, one block 20 blocks ahead, one block 30 blocks ahead, and so on.

What makes it basically impossible to avoid this prefetching with a sufficiently cleverly chosen and unpredictable pattern is how streams get recycled. Normally you might think that new IOs and new (attempted) streams would push out old streams, so if you just flood ZFS with a sufficiently long sequence of unpatterned IO you would be fine. It turns out that this is not how ZFS does it; streams only get recycled based on time since last use. A stream must be unused for at least 2 seconds before it will be tossed in favour of a new attempt to create a stream. So if you are doing a bunch of unpredictable IO on a file, your first eight or so IOs will create eight initial streams, which will then sit there for two seconds attempting to match themselves up with some IO you're doing. Only after those 2 seconds will they start to be tossed in favour of new ones (and then this cycle repeats for another two seconds or so, assuming that you can do completely unpatterned IO).

Given that two seconds covers both a significant amount of IO and worse, an unpredictable amount, this is why I say that ZFS prefetching can only really be defeated by randomized IO.

This prefetching approach has what I consider to be several flaws. The largest single flaw is ZFS prefetching does not check to see that its work was useful. What it cares about is that it matched your IO pattern; it doesn't notice if the data it prefetched for you expired from the ARC unread. Under memory pressure this combines explosively with two other ZFS prefetching features. First, prefetch streams seem to be valid forever as long as the znode for their file stays in memory; they are more or less never declared too old to be useful. Second, as previously mentioned a successful match against your IO pattern triggers more and more prefetching, even if this prefetched data will just expire unread from the ARC. Now that I understand the ZFS prefetching mechanisms, I can see why our prefetching problems happened; sequential reads from a lot of files at once is a worst case for ZFS prefetching.

This shows how vital it is for any prefetch mechanism to have end to end feedback. Prefetch should not be considered successful until the user-level code itself has read the prefetched data from cache. Merely predicting the user-level IO pattern is not good enough except in favorable situations.

(For example, as far as I can see ZFS will happily prefetch the maximum amount for a file that you are reading through sequentially at a rate of one read every five minutes even in the presence of a huge IO load that evicts the prefetched data from the ARC long before you issue the next read.)

Sidebar: the tunables and code involved in this

All of the prefetching code is in uts/common/fs/zfs/dmu_zfetch.c in your handy OpenSolaris or Illumos source repository. In OpenSolaris, the available tunables are:

zfs_prefetch_disable Disable prefetching.
zfetch_max_streams Maximum number of streams per file; defaults to 8.
zfetch_block_cap The maximum number of blocks that we can ever prefetch when a stream triggers prefetch. These blocks may be split up between multiple IO positions. Defaults to 256 blocks.
zfetch_array_rd_sz Do not do prefetching for reads larger than this size (in bytes). Defaults to 1 MB.
zfetch_min_sec_reap Minimum seconds before an inactive stream can be reclaimed; defaults to 2 seconds.

All of these tunables are global ones; they affect all ZFS filesystems and all ZFS pools.

ZFSHowPrefetching written at 00:36:05; Add Comment

2012-10-14

DTrace: two mistakes you can make in DTrace scripts

There are certain mistakes that you can make when writing DTrace scripts that are worth pointing out, partly because I've made them myself and partly because I've seen them in other people's scripts on the net. To serve as an example of this, suppose that you want to track ZFS read and write volume, so you write the following:

fbt::zfs_read:entry
{ @zreads = sum(args[1]->uio_resid); }

fbt::zfs_write:entry
{ @zwrites = sum(args[1]->uio_resid); }

This is perfectly good DTrace code and it works fine most of the time, but it has two significant issues.

The first is what I'll call the initiated versus completed problem. Because we are tracing on entry to zfs_read and zfs_write, we're counting IO activity when it's initiated, not when it's performed (whatever that means) or when it's completed. If IO completes rapidly and we're using a long time base, this doesn't matter but if either are potentially false this may give you misleading or confusing numbers, ones that do not match up to the reality as experienced by applications.

When to account for activity is in general a hard problem with no clear generic answers; how you want to do it can depend very much on what questions you want answers to. But my default assumption is to account for things on completion, especially for IO, because that's when a user-level program can go on; it's also clearly counting the past instead of anticipating the future. You may have different answers, but you definitely want to think about the issue. And if you pick 'counted when initiated', you probably should write some sort of comment about why this makes sense in your particular case.

The second problem is that this blithely ignores the possibility of errors of all sorts. Both zfs_read and zfs_write can be asked to do impossible things and can run into problems; in both cases, some or all of the IO will not get done. Accounting for things when you enter a routine necessarily means ignoring errors, but you can make this mistake even when you trace returns (I speak from personal experience here; fortunately it gave me crazy numbers and some other problems until I realized what was going on).

(Note that you don't need to run into actual errors to get incorrect results with reading, because short reads can easily happen; consider issuing a large read when you're almost at the end of a file.)

Unfortunately, knowing what sort of errors are possible and how to compensate for them generally takes reading (Open)Solaris kernel source and probably some experimentation. I have to admit that right now I don't know how to do fully correct IO size accounting for these routines, for example, because I'm not sure how to get the actual amount read or written for short reads and writes. It may be that these are uncommon enough that you can ignore them and just handle clear errors.

(One of the pernicious issues with both of these mistakes is that much of the time they won't matter. If your IO almost never results in errors or short IO, completes rapidly, and you're only looking at IO volume over a long time, you're not going to be affected. I'm lucky (in a sense) that I've been dealing with a number of situations where all of this was not true so I've been smacked in the nose by this.)

DTraceTwoScriptGotchas written at 02:16:58; Add Comment

2012-10-13

DTrace: counting, aggregates, and a nice undocumented printa() feature

Suppose that you want to keep a running count of something in a DTrace script, for example the number of outstanding disk IO requests. The obvious and most flexible way to do this is to use an ordinary DTrace global variable or array; you can even find examples of this in various DTrace scripts. Unfortunately, this only works in restricted situations. If you use this generally, in a number of situations you'll sooner or later experience lost counter events. My guess is that DTrace deliberately doesn't attempt to do atomic updates of its global variables (at least not with '+='), so DTrace on two CPUs can race each other and overwrite values.

(Guess what I spent part of a day discovering recently.)

What you want to use instead is an aggregate that you increment and decrement with sum(1) and sum(-1):

fbt::zio_create:return
{ @zactive = sum(1);  tracked[args[1]] = 1; }

fbt::zio_done:entry
/ tracked[args[0]] /
{ @zactive = sum(-1); tracked[args[0]] = 0; }

This is fully reliable and will not miscount events. You need the funny condition on zio_done because there are likely to be in-flight ZIOs when your DTrace script starts; if you do not guard the counter decrement this way, you will decrement your counter when they complete without ever having incremented it, and your counter will go negative.

(I made this mistake too. I've had lots of learning experiences with DTrace recently.)

You can use a keyed aggregate to track per-thing events, for example tracking the in-flight ZIOs per pool. The unfortunate drawback to this is that as far as I know, there's no way to get or print just a single key from an aggregate. My example is that I would like to track in-flight IOs on a per-pool basis and then when a ZFS IO takes a long time dump a bunch of information about it including the count of active IOs for that pool only. But as far as I know there's no way to do this; I can only dump the active counts for all pools. This is one of the situations where a global array would be much better, because then I could pull out specific keys.

Another limitation of printing aggregates comes up here; there's no way to not print aggregate keys where the value is zero or to drop only keys with a zero value from an aggregate. At the most I'm only interested in pools with currently in-flight IOs, but I have to dump out a long list of pools that have none because DTrace gives me no choice.

(The sad bit is that I believe that all of the aggregate formatting and output is all in the user side of DTrace instead of in the kernel, so in theory you could build a new version that did all of this. But I suspect the official DTrace answer is 'use a program to postprocess the raw output of your DTrace script'. My reaction to that does not fit in the margins of this entry.)

In general, my opinion is that you want to use aggregates for any global variable that you'll do arithmetic on as your DTrace script is running unless you're completely sure that you have no multi-thread races. There are situations where you can be sure of this (for example, you are tracking something that is itself intrinsically single-threaded), but be careful.

Now we get to the very useful but completely undocumented feature of printa(), which is handy when you're using aggregates as basically thread-safe global variables. Namely, printa() can print multiple aggregates at once. This looks like:

tick-10sec
{ 
  [...]
  /* aggregate used as simple global variable */
  printa("Total ZIO: %@5d MB read %@5d MB write\n",
         @zread, @zwrite);
  [...]
  /* keyed aggregate, by pool name */
  printf("Active ZIO in pools:\n");
  printa("%@6d in %-16s  %@3d read / %@3d write\n",
         @zactive, @zractive, @zwactive);
  [...]
}

(For basic format string usage of printa(), see the end of here.)

There are two special limitations of multi-aggregate printa()s for keyed aggregates. First, the order is determined by the first aggregate to be printed (and as usual, you only get smallest to largest). Second, you can't usefully truncate aggregates to show only the top N (or the bottom N) entries.

The problem with truncation is that a keyed multi-aggregate printa() prints a record for each key that's in any aggregate (filling in zeros if some of the other aggregates don't have data for that key). If you truncate one aggregate to the top N entries, all the other aggregates still have their full keys; if you truncate all aggregates to the top N entries, they're highly likely to all have a different top N so you'll still wind up with more than N keys (and a lot of zero fields being printed).

(In other words, keyed multi-aggregate printa() is like a SQL full outer join.)

PS: dealing with this made me realize why printa() with a format doesn't force newlines; it allows you to intermix printf() for formatting regular variables and printa() to get your aggregate global variables out, all on a single printed line.

DTraceCounterNotes written at 02:39:29; Add Comment

2012-10-03

DTrace: notes on quantization and aggregates

Following on the same theme as yesterday, here's some things about DTrace's aggregates and quantization that I want to remember. As a bonus I'm going to throw in random bits and pieces about using printa().

(I should mention that all of this is on Solaris 10 update 8. Yes, this is almost fossilized by now. No, we're not interested in upgrading.)

As I mentioned yesterday, I've come around to the view that you should never use the anonymous aggregate @[...]. Using a named aggregate is just as easy and it avoids all sorts of really annoying errors when you later try to reuse @[...] for a second quick aggregation in another section of the script with different keys or quantization. (I've done that.)

The standard D style is to use an aggregate in only one place or for only one thing. However, this is not actually required. As far as I can tell, you can stuff anything in an aggregate provided that it uses the same number and type of keys and the same aggregation function; for lquantize(), the sizing parameters must also match. What this turns aggregates into is a way of grouping things together (or by deliberately changing aggregate names, they become a way of splitting things apart).

(It's useful to know that things like probefunc are considered strings, so literal strings as keys are compatible with them.)

In particular, if you're aggregating and quantizing times (durations, latencies, whatever) it's perfectly fine to have different things in different time units. You might measure some things in milliseconds, some things in microseconds, and some things in quarter milliseconds for an intermediate resolution. They can all go in the same aggregate, even. You'll want to make sure that the key includes some indirection of the (time) unit scale or you'll probably get very confused.

Similarly it's perfectly fine to repeatedly quantize the same data in different ways in different aggregates, so you're taking both a general quantize() overview and then using lquantize() to example specific sections in detail. The one drawback to doing this the straightforward way is that all of the aggregation will include all of the data. This can can make DTrace's ASCII bar graphs kind of useless because your actual data will be crushed by the out-of-range values on either side. Getting around this means recording only selected data in the aggregate, which means using a separate clause so you can make it conditional so it only applies to values inside your range of interest.

To give you an example:

fbt::rfs3_*:entry
{ self->ts = timestamp; 
  @counts[probefunc] = count();
}

fbt::rfs3_*:return
/ self->ts > 0 /
{ this->delta = (timestamp-self->ts)/1000000; }

fbt::rfs3_*:return
/ self->ts > 0 && this->delta > 100 /
{ @slow[probefunc, "ms"] = lquantize(this->delta, 100, 500, 50); }

fbt::rfs3_*:return
/ self->ts > 0 /
{ @dist[probefunc, "ms"] = quantize(this->delta);
  self->ts = 0;
}

(I'm using what I suspect is an obscure feature of this to save some space and verbosity in this example. It's not necessarily good style. Also notice how I've embedded the time unit in the aggregation key so that it'll normally show up when the aggregate is printed.)

Here we're both getting a general distribution over all NFS v3 server operations (more or less) and also getting a look at roughly how those between 100 msec and 500 msec break down. If we included all data in the lquantize() version it would be very hard to see any pattern in the 100-500 msec breakdown because almost all of the data points would be under 100 msec (well, we hope).

(PS: that this actually reports results on our systems is one of the reasons that I am now digging into this stuff.)

Finally, even with the command line switch or pragma to quiet dtrace down, every aggregation with data will be printed, either with a printa() by you or by DTrace itself if you don't print it explicitly. You can probably get out of this with an END clause that clear()'s aggregations that you don't want to be displayed. I believe that individual entries in an aggregate are printed in order of increasing numbers. This gets a little odd for quantizations; there I suspect but don't know for sure that the order is determined by something like the sum of all values submitted to the quantization. It's not particularly useful in any case.

Sidebar: a bit on printa()

The basic version of printa() with a format string is:

END
{ printf("\n");
  printa("function %-20s  %@10d\n", @counts);
  printf("\n");
  printa("function %s(), time in %s:%@d\n", @dist);
  printf("\n");
  printa("function %s(), time in %s for >= 100 ms:%@d\n", @slow);
}

The %@... bit is the format of the aggregate, and can occur anywhere in the format string; in practice the only useful thing to do to it is to give a field width (as this does for @counts). The other % formatting is for each element of the key, which are used in order but don't have to be completely used up.

The printf("\n")s are normally necessary because printa() without a formatting string adds an extra newline before and after printing the entire aggregate. With a format string, it doesn't. The newline in the format strings simply separates individual keys from the aggregate.

(This behavior of printa() is inconsistent, because if you just let DTrace dump the aggregates out it doesn't add the newline.)

DTraceQuantizationNotes written at 01:40:20; Add Comment

2012-10-02

Some notes on getting durations in DTrace

One of the things you commonly wind up doing in DTrace is computing durations, how long things take; this gives you latencies, among other useful things. Since we have a significant performance problem with our mail spool right now, durations and latencies have very much been my focus recently. In the process of trying to apply DTrace to the problem I've wound up with some things that I want to remember about this.

What I've seen as the standard or classic DTrace pattern for getting durations is to timestamp the start of an event and then calculate the duration at the end of it:

fbt::fop_read:entry
{ self->ts = timestamp; }

fbt::fop_read:return
{ @read[probefunc] = quantize((timestamp - self->ts) / 1000000); }

(The timestamp DTrace variable is in nanoseconds, so here we scale it to milliseconds. Also I've become convinced that you never want to use the anonymous '@[...]' aggregate if you're writing an actual D script, even if you have only one aggregate.)

This works (in this case), but there's some things that can be better. First, you generally want to make the clause that calculates the duration conditional on the initial timestamp being set, especially if the start and the end points are different functions or probe spots (the usual symptom of failing to do this is a few extremely large values that result from the calculation being 'timestamp - 0'). You also want to explicitly clear the timestamp after calculating the duration; this can help DTrace use less memory and avoids accidents (again, especially if the start and the end point are different). So:

fbt::fop_read:return
/ self->ts > 0 /
{  @[probefunc] = quantize(....);
   self->ts = 0; }

Sometimes you only want to track some events, not all of them. For example, you might only be interested in low-level ZFS IO stats for read IO:

fbt::zio_create:return
/ args[1]->io_type == 1 /
{ .... }

Once you've made the start conditional, you don't need to also check the condition at the end point; you can simply be conditional on the timestamp being set. If the timestamp is set you know that you passed the condition at the start. If you do have additional conditions on the end point, you should make sure to unconditionally clear the timestamp. Since DTrace doesn't have an if or the equivalent, you get to do this the awkward way with a separate clause:

fbt::fop_read:return
/ self->ts > 0 && (timestamp - self->ts) > 250000 /
{ .... }

fbt::fop_read:return
{ self->ts = 0; }

Don't flip the order of these two. DTrace processes clauses in the order you've written them, so if you had them the other way around you'd clear the start timestamp before calculating anything with it.

(On a side note, it would be really great if quantize() and lquantize() could optionally ignore all values outside a specified range.)

In the simple case you can store the start timestamp (or timestamps) on the special self thread-local variable. However I believe that this is only useful for synchronous operations. If you want to track the latency of something that involves asynchronous operations, callbacks, and similar things you need to find a unique key and then use an associative array to store the start timestamp. So:

fbt::zio_create:return
/ args[1]->io_type /
{ ziots[args[1]] = timestamp; }

fbt::zio_done:entry
/ ziots[args[0]] > 0 /
{ ....;
  ziots[args[0]] = 0; }

If you're doing this it's especially important to clear the timestamp by assigning zero to it. Not only do you avoid accidents if the same memory address (or other key value) is later reused for another operation (which is actually quite likely in some cases, especially for the addresses of structures), but it's also the only way that now-finished things will get cleaned out of the ziots associative array.

(I'm sure that all of these are well known in the active DTrace community and probably are documented in some best practices or common tricks pages. I just haven't been sufficiently exposed to the DTrace world to have found this stuff yet.)

Sidebar: why structure address reuse is quite likely

Solaris does most of its allocation through slab allocation, generally with a structure-specific slab for any commonly used sort of object. It also has caches for many sorts of objects that are allocated and deallocated frequently enough to justify it. Thus, if you're tracking any sort of commonly used object it's very likely that when a specific instance is freed it'll be reused in short order. Either it will be near the top of the applicable 'recently freed' cache or it will get put back in its slab pool and then shortly allocated from the pool again.

DTraceDurationNotes written at 01:17:20; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.