Some notes about kernel crash dumps in Illumos
On our OmniOS servers, we should probably turn off writing kernel crash dumps on panics. It takes far too long, it usually doesn't succeed, and even if it did the information isn't useful to us in practice (we're using a very outdated version & we're frozen on it).
We're already only saving kernel pages, which is the minimum setting in dumpadm, but our fileservers still take at least an hour+ to write dumps. On a panic, we need them back in service in minutes (as few as possible).
The resulting Twitter discussion got me to take a look into the
current state of the code for this in Illumos, and I wound up
discovering some potentially interesting things. First off, dump
settings are not auto-loaded or auto-saved by the kernel in some
magical way; instead dumpadm
saves all of your configuration
settings in /etc/dumpadm.conf
and then sets them during boot
through svc:/system/dumpadm:default. The dumpadm
manual
page will tell you all of
this if you read its description of the -u
argument.
Next, the -z
argument to dumpadm
is inadequately described in
the manual page. The 'crash
dump compression' it's talking about is whether savecore
will
write compressed dumps; it has nothing to do with how the kernel
writes out the crash dump to your configured crash device. In fact,
dumpadm
has no direct control over basically any of that process;
if you want to change things about the kernel dump process, you
need to set kernel variables through /etc/system
(or 'mdb -k
').
The kernel writes crash dumps in multiple steps. If your console
shows the message 'dumping to <something>, offset NNN, contents:
<...>
', then you've at least reached the start of writing out the
crash dump. If you see updates of the form 'dumping: MM:SS N%
done
', the kernel has reached the main writeout loop and is writing
out pages of memory, perhaps excessively slowly. As far as I can
tell from the code, crash dumps don't abort when they run out of
space on the dump device; they keep processing things and just
throw all of the work away.
As it turns out, the kernel always compresses memory as it writes it out, although this is obscured by the current state of the code. The short version is that unless you set non-default system parameters that you probably don't want to, current Illumos systems will always do single threaded lzjb compression of memory (where the CPU that is writing out the crash dump also compresses the buffers before writing). Although you can change things to do dumps with multi-threaded compression using either lzjb or bzip2, you probably don't want to, because the multi-threaded code has been deliberately disabled and is going to be removed sometime. See Illumos issue 3314 and the related Illumos issue 1369.
(As a corollary of kernel panic dumps always compressing with at least ljzb, you probably should not have compression turned on on your dump zvol (which I believe is the default).)
I'm far from convinced that single threaded lzjb compression can reach and sustain the full write speed of our system SSDs on our relatively slow CPUs, especially during a crash dump (when I believe there's relatively little write buffering going on), although for obvious reasons it's hard to test. People with NVMe drives might have problems even with modern fast hardware.
If you examine the source of dumpsubr.c,
you'll discover a tempting variable dump_timeout
that's set to
120 (seconds) and described as 'timeout for dumping pages'. This
comment is a little bit misleading, as usual; what it really means
is 'timeout for dumping a single set of pages'. There is no limit
on how long the kernel is willing to keep writing out pages for,
provided that it makes enough progress within 120 seconds. In our
case this is unfortunate, since we'd be willing to spend a few
minutes to gather a bit of crash information but not anything like
what a kernel dump appears to take on our machines.
(The good news is that if you run out of space on your dump device,
the dump code is at least smart enough to not spend any more time
trying to compress pages; it just throws them away right away. You
might run out of space because you're taking a panic dump from a
ZFS fileserver with 128 GB of RAM and
putting it on an 8GB dump zvol that is part of a rpool that lives
on 80 GB SSDs, where a full-sized kernel dump almost certainly can't
even be saved by savecore
.)
PS: To see that the default is still a single-threaded crash dump,
you need to chase through the code to dumphdr.h
and the various DUMP_PLAT_*_MINCPU
definitions, all of which
are set to 0. Due to how the code is structured, this disables
multi-threaded dumps entirely.
Sidebar: The theoretical controls for multi-threaded dumps
If you set dump_plat_mincpu
to something above 0, then if you
have 'sufficiently more' CPUs than this, you will get parallel bzip2
compression; below that you will get parallel lzjb. Since parallel
compression is disabled by default in Illumos, this may or may not
actually still work, even if you don't run into any actual bugs of
the sort that caused it to be disabled in the first place. Note
that bzip2 is not fast.
The actual threshold of 'enough' depends on the claimed maximum
transfer size of your disks. For dumping to zvols, it appears that
this maximum transfer size is always 128 KB, which uses a code path
where the breakpoint between parallel lzjb and parallel bzip2 is just
dump_plat_mincpu
; if you have that many CPUs or more, you get
bzip2. This implies that you may want to set dump_plat_mincpu
to a
nice high number so that you get parallel lzjb all the time.
|
|