2013-01-04
DTrace's stable providers are not good enough
In a comment on my entry on why DTrace doesn't attract people to Solaris that often, Brendan Gregg left a comment where he drew a distinction between two levels of using DTrace:
In terms of difficulty, using the DTrace providers is a little like:
- fbt provider: writing a simple kernel patch
- stable providers: writing a shell script
Sysadmins should be able to handle stable providers (eg, io, proc, sched, vminfo). They are documented - you don't need to reach for kernel code. Programming them may be no more difficult than shell scripting.
(For 'fbt provider', you should actually read 'any unstable provider' (my DTrace scripts use sdt as well, for example). And I think that using unstable providers is easier than writing a kernel patch in that you can get far with a basic ability to read C code, although my perspective may be skewed.)
I'll start by saying that my experience has given me some strong biases here and it's possible that I'm missing a world of DTrace usage. Also, all of this is from the perspective of someone using Solaris 10 update 8; some of this has changed in Solaris 11 and perhaps with Illumos. With that said, though:
I agree with Brendan's characterization, but the problem is that DTrace's stable providers are not good enough. For a glaring example of this, getting even relatively basic information about NFS server activity requires using unstable providers (although I think this has been fixed in Solaris 11). The cold, hard truth (as I wrote about a bit when I talked about why we hadn't taken to DTrace) is that the Solaris developers never attempted to develop even a vaguely complete set of stable providers. Almost everything really useful is unstable and thus undocumented (and this leads to the need for system programmers and kernel source code in order to write useful DTrace scripts).
At one level, my personal experience is not necessarily representative; we use our Solaris machines as fileservers and have almost no programs running on them locally. If I had to diagnose many local programs I might value the stable providers (my impression is that many of them are at the kernel to userland interface level), but as it stands I don't think I've ever directly used any stable provider and the information from other people's DTrace scripts using them was at most vaguely useful.
One major lack of stable providers is that almost no significant subsystem inside Solaris has stable providers (although Oracle seems to be changing this in Solaris 11, based on documentation). I've mentioned the NFS server and also I think the NFS client, but ZFS is another large example. There are a lot of important and interesting ZFS activities, all of which have to be extracted through unstable providers. Want to watch device multipathing activity to see if everything is fine? Unstable providers again. You get the idea.
(In fact it's not unusual for kstats to provide better visibility into a subsystem than DTrace does with stable providers.)
One effect of unstable providers being so necessary to solve real problems is that the 'shell scripting' level of DTrace is not deeply useful. Sure, you can put together something from documented interfaces and the DTrace manual (if someone hasn't already written the general script and put it on the net), but what you can write probably won't do you much good or tell you things that are very interesting.
Another problem is that relying on unstable providers makes even canned DTrace scripts harder for people to use, as I ran into with my DTrace scripts. Some kernel data structures changed a bit between S10U8 (what we use) and later versions, which means that my scripts don't work as-is for anyone on a later version and need to be edited at least a bit. Requiring sysadmins to make magic edits to scripts before they can use them is not exactly encouraging people to like and use DTrace.
(DTrace doesn't provide any mechanisms to make this easier, although once again it easily could if it actually cared about this issue. But it doesn't, because this is what sysadmins deserve when they use unstable providers, right?)