Consider setting your Linux servers to reboot on kernel problems

January 23, 2019

As I sort of mentioned when I wrote about things you can do to make your Linux servers reboot on kernel problems, the Linux kernel normally doesn't reboot if it hits kernel problems. Problems like OOPSes and RCU stalls generally kill some processes and try to continue on; more serious issues cause panics, which freeze the machine entirely.

If your goal is to debug kernel problems, this is great because it preserves as much of the evidence as possible (although you probably also want things like a serial console or at least netconsole, to capture those kernel crash messages). If your goal is to have your servers running, it is perhaps not as attractive; you may quite reasonably care more about returning them to service as soon as possible than trying to collect evidence for a bug report to your distribution.

(Even if you do care about collecting information for a bug report, there are probably better ways than letting the machine sit there. Future kernels will have a kernel sysctl called panic_print to let you dump out as much information in the initial report as possible, which you can preserve through your console server system, and in general there is Kdump (also). In theory netconsole might also let you capture the initial messages, but I don't trust it half as much as I do a serial console.)

My view is that most people today are in the second situation, where there's very little you're going to do with a crashed server except reboot or power cycle it to get it back into service. If this is so, you might as well cut out the manual work by configuring your servers to reboot on kernel problems, at least as their initial default settings. You do want to wait just a little bit after an OOPS to reboot, in the hopes that maybe the kernel OOPS message will be successfully written to disk or transmitted off to your central syslog server, but that's it; after at most 60 seconds or so, you should reboot.

(If you find that you have a machine that is regularly OOPSing and you want to diagnose in a more-hands on way, you can change the settings on it as needed.)

We have traditionally not thought about this and so left our servers in the standard default 'lock up on kernel problems' configuration, which has gone okay because kernel problems are very rare in the first place. Leaving things as they are would still be the least effort approach, but changing our standard system setup to enable reboots on panics would not be much effort (it's three sysctls in one /etc/sysctl.d file), and it's probably worth it, just in case.

(This is the kind of change that you hope not to need, but if you do wind up needing it, you may be extremely thankful that you put it into place.)

PS: Not automatically rebooting on kernel panics is pretty harmless for Linux machines that are used interactively, because if the machine has problems there's a person right there to immediately force a reboot. It's only unattended machines such as servers where this really comes up. For desktop and laptop focused distributions it probably makes troubleshooting somewhat easier, because at least you can ask someone who's having crash problems to take a picture of the kernel errors with their phone.

Written on 23 January 2019.
« A little surprise with Prometheus scrape intervals, timeouts, and alerts
The Linux kernel's pstore error log capturing system, and ACPI ERST »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed Jan 23 23:24:23 2019
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.