Some views on more flexible (Illumos) kernel crash dumps
In my notes about Illumos kernel crash dumps, I mentioned that we've now turned them off on our OmniOS fileservers. One of the reasons for this is that we're running an unsupported version of OmniOS, including the kernel. But even if we were running the latest OmniOS CE and had commercial support, we'd do the same thing (at least by default, outside of special circumstances). The core problem is that our needs conflict with what Illumos crash dumps want to give us right now.
The current implementation of kernel crash dumps basically prioritizes
capturing complete information. There are various manifestations
of this in the implementation, starting with how it assumes that
if crash dumps are configured at all, you have set up enough disk
space to hold the full crash dump level you've set in
dumpadm, so it's sensible to not bother
checking if the dump will fit and treating failure to fit as an
unusual situation that is not worth doing much special about. Another
one is the missing feature that there is no overall time limit on
how long the crash dump will run, which is perfectly sensible if the
most important thing is to capture the crash dump for diagnosis.
But, well, the most important thing is not always to capture complete diagnostic information. Sometimes you need to get things back into service before too long, so what you really want is to capture as much information as possible while still returning to service in a certain amount of time. Sometimes you only have so much disk space available for crash dumps, and you would like to capture whatever information can fit in that disk space, and if not everything fits it would be nice if the most important things were definitely captured.
All of this makes me wish that Illumos kernel crash dumps wrote certain critical information immediately, at the start of the crash dump, and then progressively extended the information in the crash dump until they either ran out of space or ran out of time. What do I consider critical? My first approximation would be, in order, the kernel panic, the recent kernel messages, the kernel stack of the panicing kernel process, and the kernel stacks of all processes. Probably you'd also want anything recent from the kernel fault manager.
The current Illumos crash dump code does have an order for what gets written out, and it does put some of this stuff into the dump header, but as far as I can tell the dump header only gets written at the end. It's possible that you could create a version of this incremental dump approach by simply writing out the incomplete dump header every so often (appropriately marked with how it's incomplete). There's also a 'dump summary' that gets written at the end that appears to contain a bunch of this information; perhaps a preliminary copy could be written at the start of the dump, then overwritten at the end if the dump is complete. Generally what seems to take all the time (and space) with our dumps is the main page writing stuff, not a bunch of preliminary stuff, so I think Illumos could definitely write at least one chunk of useful information before it bogs down. And if this needs extra space in the dump device, I would gladly sacrifice a few megabytes to have such useful information always present.
(It appears that the Illumos kernel already keeps a lot of ZFS data
memory out of kernel crash dumps, both for the ARC and for in-flight
ZFS IO, so I'm not sure what memory the kernel is spending all of
its time dumping in our case. Possibly we have a lot of ZFS metadata,
which apparently does go into crash dumps. See the comments about
crash dumps in abd.c
For the 'dump summary', see the
dump_messages functions in dumpsubr.c.)
PS: All of this is sort of wishing from the sidelines, since our future is not with Illumos.