You probably need to think about how to handle core dumps on modern Linux servers

April 30, 2018

Once upon a time, life was simple. If and when your programs hit fatal problems, they generally dumped core in their current directory under the name core (sometimes you could make them be core.<PID>). You might or might not ever notice these core files, and some of the time they might not get written at all because of various permissions issues (see the core(5) manpage). Then complications ensued due to things like Apport, ABRT, and systemd-coredump, where an increasing number of Linux distributions have decided to take advantage of the full power of the kernel.core_pattern sysctl to capture core dumps themselves.

(The Ubuntu Apport documentation claims that it's disabled by default on 'stable' releases. This does not appear to be true any more.)

In a perfect world, systems like Apport would capture core dumps from system programs for themselves and arrange that everything else was handled in the traditional way, by writing a core file. Unfortunately this is not a perfect world. In this world, systems like Apport almost always either discard your core files entirely or hide them away where you need special expertise to find them. Under many situations this may not be what you want, in which case you need to think about what you do want and what's the best way to get it.

I think that your options break down like this:

  • If you're only running distribution-provided programs, you can opt to leave Apport and its kin intact. Intercepting and magically handling core dumps from standard programs is their bread and butter, and the result will probably give you the smoothest way to file bug reports with your distribution. Since you're not running your own programs, you don't care about how Apport (doesn't) handle core dumps for non-system programs.

  • Disable any such system and set kernel.core_pattern to something useful; I like 'core.%u.%p'. If the system only runs your services, with no users having access to it, you might want to have all core dumps written to some central directory that you monitor; otherwise, you probably want to set it so that core dumps go in the process's current directory.

    The drawback of this straightforward approach is that you'll fail to capture core dumps from some processes.

  • Set up your own program to capture core dumps and save them somewhere. The advantage of such a program is that you can capture core dumps under more circumstances and also that you can immediately trigger alerting and other things if particular programs or processes die. You could even identify when you have a core dump for a system program and pass the core dump on to Apport, systemd-coredump, or whatever the distribution's native system is.

    One drawback of this is that if you're not careful, your core dump handler can hang your system.

If you have general people running things on your servers and those things may run into segfaults and otherwise dump core, it's my view that you probably want to do the middle option of just having them write traditional core files to the current directory. People doing development tend to like having core files for debugging, and this option is likely to be a lot easier than trying to educate everyone on how to extract core dumps from the depths of the system (if this is even possible; it's theoretically possible with systemd at least).

Up until now we've just passively accepted the default of Apport on our Ubuntu 16.04 systems, but now that we're considering what we want to change for Ubuntu 18.04 and I've been reminded of this whole issue by Julia Evans' How to get a core dump for a segfault on Linux (where she ran into the Apport issue), I think we want to change things to the traditional 'write a core file' setup (which is how it was in Ubuntu 14.04).

(Also, Apport has had its share of security issues over the years, eg 1, 2.)

PS: Since systemd now wants to handle core dumps, I suspect that this is going to be an issue in more and more Linux distributions. Or maybe everyone is going to make sure that that part of systemd doesn't get turned on.

Written on 30 April 2018.
« Microsoft's Bingbot crawler is on a relative rampage here
An interaction of low ZFS recordsize, compression, and advanced format disks »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Apr 30 21:56:54 2018
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.