2023-05-30
Some tricks for getting the data you need when using bpftrace
When I talked about drgn versus bpftrace, I mentioned that one issue with bpftrace is that it doesn't have much access to global variables in the kernel (and things that they point to); at the moment it seems that bpftrace can only access (some) global variables in the main kernel, and not global variables in modules. However, often the information you may want to get is in module global variables, for example the NFS locks that the kernel NFS server is tracking or important state variables for changes in the ZFS ARC target size. When you want to get at these, you need to resort to a number of tricks, which all boil down to one idea: you find a place where what you want to know is exposed as a function argument or a function return value, because bpftrace has access to both of those.
(All of this means that you're going to need to read the kernel source, specifically the kernel source for the version of the kernel you're using, since the internal kernel structure changes over time.)
If you're really lucky, a function or kernel tracepoint that you already want to track will be passed the information you're interested in. This is unfortunately relatively rare, probably because there's usually no point in passing in an argument that's already available as a global variable.
Sometimes, you'll be able to find something that is called once on each item in a complex global data structure, which will let you indirectly see that global data structure. This was the case with bpftrace dumping of NFS lock clients, which also illustrates that you may need to do something to trigger this traversal (here, reading from /proc/locks). In general, files in /proc often have a kernel function that will produce one line of them and are given as an argument something they're reporting about.
Some kernel code is generalized by calling a function to obtain
information that's effectively from a global variable (or something
close to it). For example, ZFS on Linux has an idea of 'memory
available to ZFS' that's a critical input to decisions on the ZFS
ARC size, and this number is obtained by calling the function
'arc_available_memory()
'. If we want to know this value in
other functions (for example, the ZFS functions that decide about
shrinking the ARC target size), we can capture the information
for later use:
kretprobe:arc_available_memory { $rv = (int64) retval; @arc_available_memory = $rv; }
Here I'm capturing this information in a global bpftrace value, because it truly is a global piece of information. ZFS may call this function in many contexts, not just when thinking about shrinking the ARC target size, but all we care about is having it available later so the extra times we'll update our bpftrace global generally don't matter.
There are two unfortunate limitations of this approach, due to how the kernel is structured. First, some of what look like function calls in the kernel source code are actually #define'd macros in the kernel header files; you obviously can't hook into these with bpftrace. Second, some functions are inlined into their callers, often because they've specifically been marked as 'always inline'. These functions can't be traced either, which can be a pity because they're often exactly the sort of access functions that'd give us useful information.
(There are some general bpftrace techniques for picking up information that you want, but they're for another entry.)
PS: I believe that bpftrace can access CPU registers (and thus the stack) and can insert tracepoints inside functions, not just at their start. In theory with enough work this would allow you to get access to any value ever explicitly materialized at some point in a function (either in a register or in a local on the stack). In practice, this would be at best a desperation move; you'd have to disassemble code in your specific kernel to determine instruction offsets and other critical information in order to pull this off.
PPS: In theory with sufficient work you might be able to get access to module global variables in bpftrace. Their addresses are in /proc/kallsyms and I think you might be able to insert that address into a bpftrace script, then cast it to the relevant (pointer) type and dereference it. But this is untested and again I wouldn't want to do this in anything real.