2016-07-28
A bit about what we use DTrace for (and when)
Earlier this year, Byran Cantrill kind of issued a call for people to talk about their DTrace success stories. I do want to write up a blog entry about all of the times we've used DTrace to solve our problems, but it's clearly not happening soon, so for now I want to stop stalling and at least say a bit about the kind of situations we use DTrace for.
Unlike some people, we don't make routine use of DTrace; it's not a part of ongoing system monitoring, for example. Partly this is because our fileservers spend most of their time not having problems. When stuff sits there quietly working, we don't need to pay much attention to it. There's probably useful information that DTrace could gather for us on an ongoing basis, but we just don't use it that way at the moment.
What we do use DTrace for is deep system investigations during problems and crises. Some of this is having scripts available that can do detailed monitoring of areas of interest to us; when an NFS fileserver problem appears, we can start by firing up our existing information collection scripts. A lot of the time we have merely ordinary problems and the scripts will tell us what they are (a slow disk, a user pushing a huge volume of IO, etc). Some of the time we have extraordinary problems and the existing scripts just let us rule things out.
Some of the time we have a new and novel problem, or even a crisis. In these situations we use DTrace to dig deep into the depths of the live kernel and pull out information we probably couldn't get any other way. This tends to be done with ad hoc hacked together scripts instead of anything more carefully developed; as we explore the problem we find questions to ask, write DTrace snippets to give us answers, and iterate this process. Often the questions we're asking (and the answers we're getting) are so specific to the current problem and our suspicions that there's no point in cleaning the resulting scripts up; they're the equivalent of one-off shell scripts and we'll almost certainly never use them again. DTrace is only one of the tools we use in these situations, of course, but it's an extremely valuable one and has let us understand deep issues (although not always solve them).
(Some of the time an ad hoc tool seems useful enough to be turned into something more, even if it turns out that I basically never use it again.)