Some notes on errno when tracing Linux kernel system call results

July 13, 2023

Suppose, not entirely hypothetically, that you want to print out some information about every fcntl() lock call that fails, system-wide. These days this is relatively easy to do with bpftrace, especially since there are system call entry and exit tracepoints. However, you might reasonably wonder how the fcntl(2) system call actually returns errno, the error code, and how this manifests at the level of the sys_exit_fcntl syscalls tracepoint. As it turns out, there's some tribal knowledge and peculiarities here.

First off, in most contexts inside the Linux kernel, errno values are represented as negative values. If a call returns an error, it will return, eg, '-ELOOP' (this can be the source of interesting bugs). This is how errno is reported in (most) system call exits, including for fcntl(). So the answer is that in tracepoint:syscalls:sys_exit_fcntl in bpftrace, args->ret will be below zero. You don't have the system call arguments handy in the exit handler, but you can write something like this using the pattern of capturing data for later:

tracepoint:syscalls:sys_enter_fcntl
/args->cmd > 4/
{
  @fd[tid] = args->fd;
  @cmd[tid] = args->cmd;
  @flag[tid] = 1;
}

tracepoint:syscalls:sys_exit_fcntl
/@flag[tid] != 0/
{
  if (args->ret < 0) {
    printf("FAIL: fcntl(%u, %u, ...) = %ld  for '%s' PID %lu UID %lu\n", @fd[tid], @cmd[tid], args->ret, comm, pid, uid);
  }
  delete(@fd[tid]);
  delete(@cmd[tid]);
  delete(@flag[tid]);
}

You can turn ordinary errno numbers into the relevant errno name with the 'errno' command, although you'll have to make them positive again:

$ errno 9
EBADF 9 Bad file descriptor

However, if you run a bpftrace program like this for long enough you may begin to see very odd reported errnos that are, for example '-512'. The errno command will not tell you about these and you won't find them listed in sources like errno(3). The reason for this is that these are basically internal use errno codes, which you can find listed in the kernel's include/linux/errno.h. The most common one I've seen is -512, which is ERESTARTSYS. As for why I'm seeing them, I'll quote the comment in the file:

These should never be seen by user programs. To return one of ERESTART* codes, signal_pending() MUST be set. Note that ptrace can observe these at syscall exit tracing, but they will never be left for the debugged user process to see.

Unsurprisingly, if ptrace() can see them, so can kernel tracepoints. Whether or not you make your bpftrace code skip over reporting them is up to you, but I'm probably going to do that (since these values are never returned to user level).

As a side note, if I'm reading the kernel source code correctly, ERESTARTSYS is handled basically by moving the user process's instruction pointer back to the start of the system call, so that when the kernel returns to the process, the process just makes the system call again. See arch_do_signal_or_restart() in arch/x86/kernel/signal.c. This strikes me as simultaneously elegant and terrifying.

(This elaborates on a Fediverse post of mine.)

Written on 13 July 2023.
« Two views of security and vulnerability scanners
The theory versus the practice of "static websites" »

Page tools: View Source.
Search:
Login: Password:

Last modified: Thu Jul 13 22:51:53 2023
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.