Some notes about kernel crash dumps in Illumos

November 18, 2018

I tweeted:

On our OmniOS servers, we should probably turn off writing kernel crash dumps on panics. It takes far too long, it usually doesn't succeed, and even if it did the information isn't useful to us in practice (we're using a very outdated version & we're frozen on it).

We're already only saving kernel pages, which is the minimum setting in dumpadm, but our fileservers still take at least an hour+ to write dumps. On a panic, we need them back in service in minutes (as few as possible).

The resulting Twitter discussion got me to take a look into the current state of the code for this in Illumos, and I wound up discovering some potentially interesting things. First off, dump settings are not auto-loaded or auto-saved by the kernel in some magical way; instead dumpadm saves all of your configuration settings in /etc/dumpadm.conf and then sets them during boot through svc:/system/dumpadm:default. The dumpadm manual page will tell you all of this if you read its description of the -u argument.

Next, the -z argument to dumpadm is inadequately described in the manual page. The 'crash dump compression' it's talking about is whether savecore will write compressed dumps; it has nothing to do with how the kernel writes out the crash dump to your configured crash device. In fact, dumpadm has no direct control over basically any of that process; if you want to change things about the kernel dump process, you need to set kernel variables through /etc/system (or 'mdb -k').

The kernel writes crash dumps in multiple steps. If your console shows the message 'dumping to <something>, offset NNN, contents: <...>', then you've at least reached the start of writing out the crash dump. If you see updates of the form 'dumping: MM:SS N% done', the kernel has reached the main writeout loop and is writing out pages of memory, perhaps excessively slowly. As far as I can tell from the code, crash dumps don't abort when they run out of space on the dump device; they keep processing things and just throw all of the work away.

As it turns out, the kernel always compresses memory as it writes it out, although this is obscured by the current state of the code. The short version is that unless you set non-default system parameters that you probably don't want to, current Illumos systems will always do single threaded lzjb compression of memory (where the CPU that is writing out the crash dump also compresses the buffers before writing). Although you can change things to do dumps with multi-threaded compression using either lzjb or bzip2, you probably don't want to, because the multi-threaded code has been deliberately disabled and is going to be removed sometime. See Illumos issue 3314 and the related Illumos issue 1369.

(As a corollary of kernel panic dumps always compressing with at least ljzb, you probably should not have compression turned on on your dump zvol (which I believe is the default).)

I'm far from convinced that single threaded lzjb compression can reach and sustain the full write speed of our system SSDs on our relatively slow CPUs, especially during a crash dump (when I believe there's relatively little write buffering going on), although for obvious reasons it's hard to test. People with NVMe drives might have problems even with modern fast hardware.

If you examine the source of dumpsubr.c, you'll discover a tempting variable dump_timeout that's set to 120 (seconds) and described as 'timeout for dumping pages'. This comment is a little bit misleading, as usual; what it really means is 'timeout for dumping a single set of pages'. There is no limit on how long the kernel is willing to keep writing out pages for, provided that it makes enough progress within 120 seconds. In our case this is unfortunate, since we'd be willing to spend a few minutes to gather a bit of crash information but not anything like what a kernel dump appears to take on our machines.

(The good news is that if you run out of space on your dump device, the dump code is at least smart enough to not spend any more time trying to compress pages; it just throws them away right away. You might run out of space because you're taking a panic dump from a ZFS fileserver with 128 GB of RAM and putting it on an 8GB dump zvol that is part of a rpool that lives on 80 GB SSDs, where a full-sized kernel dump almost certainly can't even be saved by savecore.)

PS: To see that the default is still a single-threaded crash dump, you need to chase through the code to dumphdr.h and the various DUMP_PLAT_*_MINCPU definitions, all of which are set to 0. Due to how the code is structured, this disables multi-threaded dumps entirely.

Sidebar: The theoretical controls for multi-threaded dumps

If you set dump_plat_mincpu to something above 0, then if you have 'sufficiently more' CPUs than this, you will get parallel bzip2 compression; below that you will get parallel lzjb. Since parallel compression is disabled by default in Illumos, this may or may not actually still work, even if you don't run into any actual bugs of the sort that caused it to be disabled in the first place. Note that bzip2 is not fast.

The actual threshold of 'enough' depends on the claimed maximum transfer size of your disks. For dumping to zvols, it appears that this maximum transfer size is always 128 KB, which uses a code path where the breakpoint between parallel lzjb and parallel bzip2 is just dump_plat_mincpu; if you have that many CPUs or more, you get bzip2. This implies that you may want to set dump_plat_mincpu to a nice high number so that you get parallel lzjb all the time.

Written on 18 November 2018.
« Restisting the temptation to rely on Ubuntu for Django 1.11
Old zombie Linux distribution versions aren't really doing you any favours »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sun Nov 18 01:37:56 2018
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.