The interesting error codes from Linux program segfault kernel messages

February 10, 2018

When I wrote up what the Linux kernel's messages about segfaulting programs mean, I described what went into the 'error N' codes and how to work out what any particular one meant, but I didn't inventory them all. Rather than put myself through reverse engineering what any particular error code means, I'm going to list them all here, in ascending order.

The basic kernel message looks like this:

testp[9282]: segfault at 0 ip 0000000000401271 sp 00007ffd33b088d0 error 4 in testp[400000+98000]

We're interested in the 'error N' portion, and a little bit in the 'at N' portion (which is the faulting address).

For all of these, the fault happens in user mode so I'm not going to mention it specifically for each one. Also, the list of potential reasons for these segfaults is not exhaustive or fully detailed.

  • error 4: (Data) read from an unmapped area.

    This is your classic wild pointer read. On 64-bit x86, most of the address space is unmapped so even a program that uses a relatively large amount of memory is hopefully going to have most bad pointers go to memory that has no mappings at all.

    A faulting address of 0 is a NULL pointer and falls into page zero, the lowest page in memory. The kernel prevents people from mapping page zero, and in general low memory is never mapped, so reads from small faulting addresses should always be error 4s.

  • error 5: read from a memory area that's mapped but not readable.

    This is probably a pointer read of a pointer that is so wild that it's pointing somewhere in the kernel's area of the address space. It might be a guard page, but at least some of the time mmap()'ing things with PROT_NONE appears to make Linux treat them as unmapped areas so you get error code 4 instead. You might think this could be an area mmap()'d with other permissions but without PROT_READ, but it appears that in practice other permissions imply the ability to read the memory as well.

    (I assume that the Linux kernel is optimizing PROT_NONE mappings by not even creating page table entries for the memory area, rather than carefully assembling PTEs that deny all permissions. The error bits come straight from the CPU, so if there are no PTEs the CPU says 'fault for an unmapped area' regardless of what Linux thinks and will report in, eg, /proc/PID/maps.)

  • error 6: (data) write to an unmapped area.

    This is your classic write to a wild or corrupted pointer, including to (or through) a null pointer. As with reads, writes to guard pages mmap()'d with PROT_NONE will generally show up as this, not as 'write to a mapped area that denies permissions'.

    (As with reads, all writes with small faulting addresses should be error 6s because no one sane allows low memory to be mapped.)

  • error 7: write to a mapped area that isn't writable.

    This is either a wild pointer that was unlucky enough to wind up pointing to a bit of memory that was mapped, or an attempt to change read-only data, for example the classical C mistake of trying to modify a string constant (as seen in the first entry). You might also be trying to write to a file that was mmap()'d read only, or in general a memory mapping that lacks PROT_WRITE.

    (All attempts to write to the kernel's area of address space also get this error, instead of error 6.)

  • error 14: attempt to execute code from an unmapped area.

    This is the sign of trying to call through a mangled function pointer (or a NULL one), or perhaps returning from a call when the stack is in an unexpected or corrupted state so that the return address isn't valid. One source of mangled function pointers is use-after-free issues where the (freed) object contains embedded function pointers.

    (Error 14 with a faulting address of 0 often means a function call through a NULL pointer, which in turn often means 'making an indirect call to a function without checking that it's defined'. There are various larger scale causes of this in code.)

  • error 15: attempt to execute code from a mapped memory area that isn't executable.

    This is probably still a mangled function pointer or return address, it's just that you're unlucky (or lucky) and there's mapped memory there instead of nothing.

    (Your code could have confused a function pointer with a data pointer somehow, but this is a lot rarer a mistake than confusing writable data with read-only data.)

If you're reporting a segfault bug in someone else's program, the error code can provide useful clues as to what's wrong. Combined with the faulting address and the instruction pointer at the time, it might be enough for the developers to spot the problem even without a core dump. If you're debugging your own programs, well, hopefully you have core dumps; they'll give you a lot of additional information (starting with a stack trace).

(Now that I know how to decode them, I find these kernel messages to be interesting to read just for the little glimpses they give me into what went wrong in a program I'm using.)

On 64-bit x86 Linux, generally any faulting address over 0x7fffffffffff will be reported as having a mapping and so you'll get error codes 5, 7, or 15 respective for read, write, and attempt to execute. These are always wild or corrupted pointers (or addresses more generally), since you never have valid user space addresses up there.

A faulting address of 0 (sometimes printed as '(null)', as covered in the first entry) is a NULL pointer itself. A faulting address that is small, for example 0x18 or 0x200, is generally an offset from a NULL pointer. You get these offsets if you have a NULL pointer to a structure and you try to look at one of the fields (in C, 'sptr = NULL; a = sptr->fld;'), or you have a NULL pointer to an array or a string and you're looking at an array element or a character some distance into it. Under some circumstances a very large address, one near 0xffffffffffffffff (the very top of memory space), can be a sign of a NULL pointer that your code then subtracted from.

(If you see a fault address of 0xffffffffffffffff itself, it's likely that your code is treating -1 as a pointer or is failing to check the return value of something that returns a pointer or '(type *)-1' on error. Sadly there are C APIs that are that perverse.)


Comments on this page:

Ehm, why these codes are not expanded with some simple text error message?

scsi, for example, has CONFIG_SCSI_CONSTANTS for more detailed error codes and similar behvaiour could be added to these errors. Even without config option.

Written on 10 February 2018.
« My failure to migrate my workstation from MBR booting to UEFI
Access control security requires the ability to do revocation »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sat Feb 10 00:49:09 2018
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.