Consider setting your Linux servers to reboot on kernel problems

January 23, 2019

As I sort of mentioned when I wrote about things you can do to make your Linux servers reboot on kernel problems, the Linux kernel normally doesn't reboot if it hits kernel problems. Problems like OOPSes and RCU stalls generally kill some processes and try to continue on; more serious issues cause panics, which freeze the machine entirely.

If your goal is to debug kernel problems, this is great because it preserves as much of the evidence as possible (although you probably also want things like a serial console or at least netconsole, to capture those kernel crash messages). If your goal is to have your servers running, it is perhaps not as attractive; you may quite reasonably care more about returning them to service as soon as possible than trying to collect evidence for a bug report to your distribution.

(Even if you do care about collecting information for a bug report, there are probably better ways than letting the machine sit there. Future kernels will have a kernel sysctl called panic_print to let you dump out as much information in the initial report as possible, which you can preserve through your console server system, and in general there is Kdump (also). In theory netconsole might also let you capture the initial messages, but I don't trust it half as much as I do a serial console.)

My view is that most people today are in the second situation, where there's very little you're going to do with a crashed server except reboot or power cycle it to get it back into service. If this is so, you might as well cut out the manual work by configuring your servers to reboot on kernel problems, at least as their initial default settings. You do want to wait just a little bit after an OOPS to reboot, in the hopes that maybe the kernel OOPS message will be successfully written to disk or transmitted off to your central syslog server, but that's it; after at most 60 seconds or so, you should reboot.

(If you find that you have a machine that is regularly OOPSing and you want to diagnose in a more-hands on way, you can change the settings on it as needed.)

We have traditionally not thought about this and so left our servers in the standard default 'lock up on kernel problems' configuration, which has gone okay because kernel problems are very rare in the first place. Leaving things as they are would still be the least effort approach, but changing our standard system setup to enable reboots on panics would not be much effort (it's three sysctls in one /etc/sysctl.d file), and it's probably worth it, just in case.

(This is the kind of change that you hope not to need, but if you do wind up needing it, you may be extremely thankful that you put it into place.)

PS: Not automatically rebooting on kernel panics is pretty harmless for Linux machines that are used interactively, because if the machine has problems there's a person right there to immediately force a reboot. It's only unattended machines such as servers where this really comes up. For desktop and laptop focused distributions it probably makes troubleshooting somewhat easier, because at least you can ask someone who's having crash problems to take a picture of the kernel errors with their phone.

Comments on this page:

From at 2019-01-24 01:05:15:

Setup pstore.

By John at 2019-01-26 17:37:03:

I had recently a general protection fault which made the whole system unusable but still somehow "half-alive"... At that time I wished it had rebooted on its own so now I would like to configure a reboot on panic but was wondering for this specific case of a general protection fault would this be covered by kernel.panic_on_oops or which sysctl setting would be more relevant to a GPF?

By cks at 2019-01-26 19:38:46:

As best I can trace the kernel code involved, a general protection fault is considered a form of OOPS and so should be covered by panic_on_oops (assuming you have panic set as well, so that the panic forces a reboot).

(The specific call chain on x86 is that the function that handles GPFs winds up calling die() if the GPF happens in kernel code, and die() ends up calling oops_end(), which checks panic_on_oops to go panic things.)

By John at 2019-01-27 16:00:45:

Thank you for your interesting feedback and further comments. So if I read your other article "Things you can do to make your Linux servers reboot on kernel problems" I should set the following two kernel settings:

kernel.panic = 10
kernel.panic_on_oops = 1
Written on 23 January 2019.
« A little surprise with Prometheus scrape intervals, timeouts, and alerts
The Linux kernel's pstore error log capturing system, and ACPI ERST »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed Jan 23 23:24:23 2019
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.