2023-06-01
Capturing data you need later when using bpftrace
When using bpftrace, it's pretty common that not all of the data you want to report on is available in one spot, at least when you have to trace kernel functions instead of tracepoints. When this comes up, there is a common pattern that you can use to temporarily capture the data for later use. To summarize this pattern, it's to save the information in an associative array that's indexed by the thread id to create a per-thread variable. If you have more than one piece of information to save, you use more than one associative array.
Let's start with the simplest case; let's suppose that you need both a function's argument (available when it's entered) and its return value (so you can report only on successful functions). Then the pattern looks like this:
kprobe:afunction { // record argument into @arg0 // under our thread id (tid) @arg0[tid] = (struct something *)arg0; } // only act if we have the argument // recorded kretprobe:afunction /@arg0[tid] != 0/ { $arg = @arg0[tid]; printf(...., $arg) // or whatever // clean up recorded argument delete(@arg0[tid]); }
This example shows all of the common pieces. At the start, we capture
the function argument we care about into an associative array
that's indexed by the current thread ID (using the tid
builtin
variable),
then, provided that we have a recorded argument we use it when the
function returns. At the end, we clean up our associative array by
deleting our entry from it; if we didn't do this, we might have an
ever-growing associative array (or arrays) as different threads
called the function we're tracing. Incidentally, one time we might
invoke the kretprobe probe without the argument recorded is if we
start tracing while an existing invocation of the function is in
flight (which may be especially likely for functions that take a
while, such as handling a NFS request and reply).
(This pattern is so common it's mentioned in the documentation as
a per-thread variable. Note that the documentation's example
delete()
s the per-thread entry just as I do here.)
The reason we didn't use a simple global variable, as I did when I was recording ZFS's idea of available memory (in another bpftrace trick) is that multiple threads may be calling this function at the same time, and if they are, using a single global variable is obviously going to give us bad results.
Another case that often comes up is that the function we want to trace directly or indirectly calls another function that looks up important information, for example to map some opaque identifier into a more useful piece of data (a string, a structure) and return it. A variant of this is where the function will generate the information we want through a process that we can't hook into, but will then call another function to validate it or act on it, at which point we can grab the data. The full version of this pattern looks something like this:
// set a marker so we know to save info kprobe:afunction { @aflag[tid] = 1; } // if we're marked, save the information kprobe:subfunction /@aflag[tid] != 0/ { @magicarg[tid] = arg0; } // if we have saved information, use it // and clear it kretprobe:afunction /@magicarg[tid] != 0/ { .... do whatever ... delete(@magicarg[tid]); } // clear the marker kretprobe:afunction /@aflag[tid] != 0/ { delete(@aflag[tid]); }
One reason we need to set a marker and only save the subfunction's
information if we're marked is that the marker is our guarantee
that the saved information will be cleared later. If we unconditionally
saved the information when subfunction() was called but only cleared
it when subfunction() was called by afunction(), that would lead
to a slow growth of dead @magicarg
entries if subfunction() is
ever called from anywhere else.
A variant on this is if our 'subfunction' is actually a peer function to our function of interest (and gets called before it), with both being called from a containing function. The pattern here is more elaborate; the containing function sets the marker and must clean up everything, with the subfunction and our function saving and using the information.
Sidebar: Tracking currently active requests/etc in bpftrace
In DTrace, the traditional way to keep a running count of something
(such as how many threads were active inside afunction()
) was to
use a map with a fixed key that was incremented with sum(1)
and
decremented with sum(-1)
(see map functions),
with the decrement generally guarded so that you knew a matching
increment had been done. Although I haven't tested it, the bpftrace
documentation on the ++ and -- operators
seems to imply that these are safe to use on at least maps with
keys (including constant keys), and perhaps global variables in
general. Even if you have to use maps, this is at least clearer
than the sum()
version.
(You'll want to guard the decrement even if you use --.)