Things you can do to make your Linux servers reboot on kernel problems

January 22, 2019

One of the Linux kernel's unusual behaviors is that it often doesn't reboot after it hits an internal problem, what is normally called a kernel panic. Sometimes this is a reasonable thing and sometimes this is not what you want and you'd like to change it. Fortunately Linux lets you more or less control this through kernel sysctl settings.

(The Linux kernel differentiates between things like OOPSes and RCU stalls, which it thinks it can maybe continue on from, and kernel panics, which immediately freeze the machine.)

What you need to do is twofold. First, you need to make it so that the kernel reboots when it considers itself to have paniced. This is set through the kernel.panic sysctl, which is a number of seconds. Some sources recommend setting this to 60 seconds under various circumstances, but in limited experience we haven't found that to do anything for us except delay reboots, so we now use 10 seconds. Setting kernel.panic to 0 restores the default state, where panics simply hang the machine.

Second, you need to arrange for various kernel problems to trigger panics. The most important thing here is usually for kernel OOPS messages or BUG messages to trigger panics; the kernel considers these nominally recoverable, except that they mostly aren't and will often leave your machine effectively hung. Panicing on OOPS is turned on by setting kernel.panic_on_oops to 1.

Another likely important sign of trouble is RCU stalls; you can panic on these with kernel.panic_on_rcu_stall. Note that I'm biased about RCU stalls. The kernel documentation in sysctl/kernel.txt mentions some other ones as well, currently panic_on_io_nmi, panic_on_stackoverflow, panic_on_unrecovered_nmi, and panic_on_warn. Of these, I would definitely be wary about turning on panic_on_warn; our systems appear to see a certain number of them in reasonably routine operation.

(You can detect these warnings by searching your kernel logs for the text 'WARNING: CPU: <..> PID: <...>'. One of our WARNs was for a network device transmit queue timeout, which recovered almost immediately. Rebooting the server due to this would have been entirely the wrong reaction in practice.)

Note that you can turn on any or all of the various panic_on_* settings while still having kernel.panic set to 0. If you do this, you convert OOPSes, RCU stalls, or whatever into things that are guaranteed to hang the whole machine when they happen, instead of perhaps having it continue on in partial operating order. There are systems where this may be desirable behavior.

PS: If you want to be as sure as possible that the machine reboots after hitting problems, you probably want to enable a hardware watchdog as well if you can. The kernel panic() function tries hard to reboot the machine, but things can probably go wrong. Unfortunately not all machines have hardware watchdogs available, although many Intel ones do.

Sidebar: The problem with kernel OOPSes

When a kernel oops happens, the kernel kills one or more processes. These processes were generally in kernel code at the time (that's usually what generated the oops), and they may have been holding locks or have been in the middle of modifying data structures, submitting IO operations, or doing other kernel things. However, the kernel has no idea what exactly needs to be done to safely release these locks, revert the data structure modifications, and so on; instead it just drops everything on the floor and hopes for the best.

Sometimes this works out, or at least the damage done is relatively contained (perhaps only access to one mounted filesystem starts hanging because of a lock held by the now-dead process that will never be unlocked). Often it is not and more or less everything grinds to a more or less immediate halt. If you're lucky, enough of the system survives long enough for the kernel oops message to be written to disk or sent out to your central syslog server.


Comments on this page:

From 78.58.206.110 at 2019-01-22 12:52:12:

Do you recommend systemd's built-in /dev/watchdog support, or some other mechanism?

PS: If you want to be as sure as possible that the machine reboots after hitting problems, you probably want to enable a hardware watchdog as well if you can.

Also worth noting that Qemu and Xen also support exposing a virtual watchdog device, so if one's VMs get wedged they too can be reset after given amount of time.

Written on 22 January 2019.
« Two annoyances I have with Python's imaplib module
A little surprise with Prometheus scrape intervals, timeouts, and alerts »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Tue Jan 22 00:44:25 2019
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.