2019-01-23
Consider setting your Linux servers to reboot on kernel problems
As I sort of mentioned when I wrote about things you can do to make your Linux servers reboot on kernel problems, the Linux kernel normally doesn't reboot if it hits kernel problems. Problems like OOPSes and RCU stalls generally kill some processes and try to continue on; more serious issues cause panics, which freeze the machine entirely.
If your goal is to debug kernel problems, this is great because it preserves as much of the evidence as possible (although you probably also want things like a serial console or at least netconsole, to capture those kernel crash messages). If your goal is to have your servers running, it is perhaps not as attractive; you may quite reasonably care more about returning them to service as soon as possible than trying to collect evidence for a bug report to your distribution.
(Even if you do care about collecting information for a bug report,
there are probably better ways than letting the machine sit there.
Future kernels will have a kernel sysctl called
panic_print
to let you dump out as much information in the
initial report as possible, which you can preserve through your
console server system, and in
general there is Kdump (also).
In theory netconsole might also let you capture the initial messages,
but I don't trust it half as much as I do a serial console.)
My view is that most people today are in the second situation, where there's very little you're going to do with a crashed server except reboot or power cycle it to get it back into service. If this is so, you might as well cut out the manual work by configuring your servers to reboot on kernel problems, at least as their initial default settings. You do want to wait just a little bit after an OOPS to reboot, in the hopes that maybe the kernel OOPS message will be successfully written to disk or transmitted off to your central syslog server, but that's it; after at most 60 seconds or so, you should reboot.
(If you find that you have a machine that is regularly OOPSing and you want to diagnose in a more-hands on way, you can change the settings on it as needed.)
We have traditionally not thought about this and so left our servers
in the standard default 'lock up on kernel problems' configuration,
which has gone okay because kernel problems are very rare in the
first place. Leaving things as they are would still be the least
effort approach, but changing our standard system setup to enable
reboots on panics would not be much effort (it's three sysctls in
one /etc/sysctl.d
file), and it's probably worth it, just in case.
(This is the kind of change that you hope not to need, but if you do wind up needing it, you may be extremely thankful that you put it into place.)
PS: Not automatically rebooting on kernel panics is pretty harmless for Linux machines that are used interactively, because if the machine has problems there's a person right there to immediately force a reboot. It's only unattended machines such as servers where this really comes up. For desktop and laptop focused distributions it probably makes troubleshooting somewhat easier, because at least you can ask someone who's having crash problems to take a picture of the kernel errors with their phone.
A little surprise with Prometheus scrape intervals, timeouts, and alerts
Prometheus pulls metrics from metric sources or, to put it in Prometheus terms, scrapes targets. Every scrape configuration and thus every target has a scrape interval and a scrape timeout as part of its settings; these can be specified explicitly or inherited from global values. In a perfect world where scraping targets either completes or fails in zero time, this results in simple timing; a target is scraped at time T, then T + interval, then T + interval + interval, and so on. However, the real world is not simple and scraping a target can take a non-zero amount of time, possibly quite a lot if you time out. You might sensibly wonder if the next scrape is pushed back by the non-zero scrape time.
The answer is that it is not, or at least it is sort of not. Regardless of the amount of time a scrape at time T takes, the next scrape is scheduled for T + interval and will normally happen then. Scrapes are driven by a ticker, which runs independently of how long a scrape took and adjust things as necessary to keep ticking exactly on time.
So far, so good. But this means that slow scrapes can have an
interesting and surprising interaction with alerting rules and
Alertmanager group_wait
settings. The short version is that you
can get a failing check and then a successful one in close succession,
close enough to suppress an Alertmanager alert that you would
normally expect to fire.
To make this concrete, suppose that you perform SSH blackbox checks
every 90 seconds, time out at 60 seconds, trigger a Prometheus alert
rule the moment a SSH check fails, and have a one minute group_wait
in Alertmanager. Then if a SSH check times out instead of failing
rapidly, you can have a sequence where you start the check at T, have
it fail via timeout at T + 60, send a firing alert to Alertmanager
shortly afterward, have the next check succeed at T + 90, and withdraw
the alert shortly afterward from Alertmanager, before the one minute
group_wait
is up. The net result is that your 'alert immediately'
SSH alert rule has not sent you an alert despite a SSH check failing.
It's natural to expect this result if your scrape interval is less
than your group_wait
, because then it's obvious that you can
get a second scrape in before Alertmanager makes the alert active.
It's not as obvious when the second scrape is possible only because
the difference between the scrape interval and the scrape timeout
is less than group_wait
.
(If nothing else, this is going to make me take another look at our scrape timeout settings. I'm going to have to think carefully about just what all of the interactions are here, especially given all of the other alert delays. Note that a resolved alert is immediately sent to Alertmanager.)
PS: It's a pity that there's no straightforward way that I know of to get either Prometheus or Alertmanager to write a log record of pending, firing, and cleared alerts (with timestamps and details). The information is more or less captured in Prometheus metrics, but getting the times when things happened is a huge pain; being able to write optional logs of this would make some things much easier.
(I believe both report this if you set their log level to 'debug', but of course then you get a flood of other information that you probably don't want.)
Sidebar: How Prometheus picks the start time T of scrapes
If you've paid attention to your logs from things like SSH blackbox checks, you'll have noticed that Prometheus does not hit all of your scrape targets at exactly the same time, even if they have the same scrape interval. How Prometheus picks the start time for each scrape target is not based on when it learns about the scrape target, as you might expect; instead, well, let me quote the code:
base = interval - now%interval offset = t.hash() % interval next = base + offset if next > interval { next -= interval }
All of these values are in nanoseconds, and t.hash()
is a 64-bit
hash value, hopefully randomly distributed. The next
result value
is an offset to wait before starting the scrape interval ticker.
In short, Prometheus randomly smears the start time for scrape targets across the entire interval, hopefully resulting in a more or less even distribution.