What I'd like in Illumos/OmniOS: progressive crash dumps

November 21, 2016

One of our fileservers had a kernel panic today as we were adding some more multipathed iSCSI disks to it. This was unfortunate but not fatal; we caught the panic almost right away and fixed things relatively fast. Which is unfortunate in its own way and brings me to my wish.

You see, this was perhaps our most important and core fileserver. Everything depends on it and everything eventually goes out to lunch if and while it's down. And in our experience, in our environment, making an OmniOS crash dump takes ages and may not succeed at the end of that (we've sat through over half an hour of the process only to have it fail). There was absolutely no way we could afford to let this fileserver sit there for minutes or tens of minutes to see if maybe it could successfully write out a crash dump this time around, so we forced a power cycle on it in order to get it back into service. The result is that we got nothing out of the panic; we don't even have the stack backtrace (it doesn't seem to have gotten written anywhere durable).

So now what I wish OmniOS had is what I'll call progressive crash dumps. A progressive crash dump would proceed in layers of detail. First it would write out very basic details (like the panic stack dump or the kernel message log) in a compact form, right away; this should hopefully take almost no time. After that had been pushed to the dump device, it would write another layer with more information that takes some more time (maybe a complete collection of various core kernel data tables, like all kernel stacks and the process table). As time went on it would write out more and more data with more and more layers of detail; if you had enough time, it would end up writing out the full crash dump that you get today.

(Dumpadm's -c argument doesn't have enough granularity to help, especially on fileservers where almost all the memory is already being consumed by kernel pages instead of user pages.)

Progressive crash dumps would insure that even if you had to reboot the machine early you would get some information; the longer you could afford to wait, the more information you'd get. And if the overall dump winded up failing or hanging, at least you would recover however many layers could be written intact (and hopefully the very basic layers would be good, simply because they are basic and so should be easy and reliable to dump).

(This is a complete blue sky wish. It would likely take a completely new dump format, new kernel dump code, and significant changes to get all of the dump tools to deal with it, all of which adds up to a lot of new code in an area that has to be extremely reliable under extreme conditions and that most people don't use very much anyways. Even if we had the money to help fund this sort of thing, there would be much higher priority Illumos things we'd care about, like our 10G Ethernet issues.)

Written on 21 November 2016.
« I've wound up feeling tentatively enthusiastic about Python 3
Link: RFC 6919: Further Key Words for Use in RFCs to Indicate Requirement Levels »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Nov 21 21:42:32 2016
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.