2019-01-22
Things you can do to make your Linux servers reboot on kernel problems
One of the Linux kernel's unusual behaviors is that it often doesn't reboot after it hits an internal problem, what is normally called a kernel panic. Sometimes this is a reasonable thing and sometimes this is not what you want and you'd like to change it. Fortunately Linux lets you more or less control this through kernel sysctl settings.
(The Linux kernel differentiates between things like OOPSes and RCU stalls, which it thinks it can maybe continue on from, and kernel panics, which immediately freeze the machine.)
What you need to do is twofold. First, you need to make it so that
the kernel reboots when it considers itself to have paniced. This
is set through the kernel.panic
sysctl, which is a number of
seconds. Some sources recommend setting this to 60 seconds under
various circumstances, but in limited experience we haven't found
that to do anything for us except delay reboots, so we now use 10
seconds. Setting kernel.panic
to 0 restores the default state,
where panics simply hang the machine.
Second, you need to arrange for various kernel problems to trigger
panics. The most important thing here is usually for kernel OOPS
messages or BUG messages to trigger panics; the kernel considers
these nominally recoverable, except that they mostly aren't and
will often leave your machine effectively hung. Panicing on OOPS
is turned on by setting kernel.panic_on_oops
to 1.
Another likely important sign of trouble is RCU stalls; you can
panic on these with kernel.panic_on_rcu_stall
. Note that I'm
biased about RCU stalls. The kernel
documentation in sysctl/kernel.txt mentions
some other ones as well, currently panic_on_io_nmi
,
panic_on_stackoverflow
, panic_on_unrecovered_nmi
, and
panic_on_warn
. Of these, I would definitely be wary about turning
on panic_on_warn
; our systems appear to see a certain number
of them in reasonably routine operation.
(You can detect these warnings by searching your kernel logs for
the text 'WARNING: CPU: <..> PID: <...>
'. One of our WARNs was
for a network device transmit queue timeout, which recovered almost
immediately. Rebooting the server due to this would have been
entirely the wrong reaction in practice.)
Note that you can turn on any or all of the various panic_on_*
settings while still having kernel.panic
set to 0. If you do
this, you convert OOPSes, RCU stalls, or whatever into things that
are guaranteed to hang the whole machine when they happen, instead
of perhaps having it continue on in partial operating order. There
are systems where this may be desirable behavior.
PS: If you want to be as sure as possible that the machine reboots
after hitting problems, you probably want to enable a hardware
watchdog as well if you can. The kernel panic()
function tries
hard to reboot the machine, but things can probably go wrong.
Unfortunately not all machines have hardware watchdogs available,
although many Intel ones do.
Sidebar: The problem with kernel OOPSes
When a kernel oops happens, the kernel kills one or more processes. These processes were generally in kernel code at the time (that's usually what generated the oops), and they may have been holding locks or have been in the middle of modifying data structures, submitting IO operations, or doing other kernel things. However, the kernel has no idea what exactly needs to be done to safely release these locks, revert the data structure modifications, and so on; instead it just drops everything on the floor and hopes for the best.
Sometimes this works out, or at least the damage done is relatively contained (perhaps only access to one mounted filesystem starts hanging because of a lock held by the now-dead process that will never be unlocked). Often it is not and more or less everything grinds to a more or less immediate halt. If you're lucky, enough of the system survives long enough for the kernel oops message to be written to disk or sent out to your central syslog server.