Some notes on errno when tracing Linux kernel system call results
Suppose, not entirely hypothetically, that you want to print out
some information about every fcntl()
lock call
that fails, system-wide. These days this is relatively easy to do
with bpftrace, especially
since there are system call entry and exit tracepoints.
However, you might reasonably wonder how the fcntl(2) system call
actually returns errno
, the error code, and how this manifests
at the level of the sys_exit_fcntl syscalls tracepoint. As it
turns out, there's some tribal knowledge and peculiarities here.
First off, in most contexts inside the Linux kernel, errno values
are represented as negative values. If a call returns an error,
it will return, eg, '-ELOOP' (this can be the source of interesting
bugs). This is how errno is reported
in (most) system call exits, including for fcntl()
. So the answer
is that in tracepoint:syscalls:sys_exit_fcntl in bpftrace,
args->ret will be below zero. You don't have the system call arguments
handy in the exit handler, but you can write something like this using
the pattern of capturing data for later:
tracepoint:syscalls:sys_enter_fcntl /args->cmd > 4/ { @fd[tid] = args->fd; @cmd[tid] = args->cmd; @flag[tid] = 1; } tracepoint:syscalls:sys_exit_fcntl /@flag[tid] != 0/ { if (args->ret < 0) { printf("FAIL: fcntl(%u, %u, ...) = %ld for '%s' PID %lu UID %lu\n", @fd[tid], @cmd[tid], args->ret, comm, pid, uid); } delete(@fd[tid]); delete(@cmd[tid]); delete(@flag[tid]); }
You can turn ordinary errno numbers into the relevant errno name with
the 'errno
' command, although you'll have to make them positive again:
$ errno 9 EBADF 9 Bad file descriptor
However, if you run a bpftrace program like this for long enough
you may begin to see very odd reported errnos that are, for example
'-512'. The errno
command will not tell you about these and you
won't find them listed in sources like errno(3)
. The reason
for this is that these are basically internal use errno codes, which
you can find listed in the kernel's include/linux/errno.h.
The most common one I've seen is -512, which is ERESTARTSYS. As for
why I'm seeing them, I'll quote the comment in the file:
These should never be seen by user programs. To return one of ERESTART* codes, signal_pending() MUST be set. Note that ptrace can observe these at syscall exit tracing, but they will never be left for the debugged user process to see.
Unsurprisingly, if ptrace() can see them, so can kernel tracepoints. Whether or not you make your bpftrace code skip over reporting them is up to you, but I'm probably going to do that (since these values are never returned to user level).
As a side note, if I'm reading the kernel source code correctly, ERESTARTSYS is handled basically by moving the user process's instruction pointer back to the start of the system call, so that when the kernel returns to the process, the process just makes the system call again. See arch_do_signal_or_restart() in arch/x86/kernel/signal.c. This strikes me as simultaneously elegant and terrifying.
(This elaborates on a Fediverse post of mine.)
|
|