Sensible reboot monitoring

September 15, 2012

I have an embarrassing confession: we recently discovered that some of our machines had been spontaneously rebooting every so often, and we hadn't noticed. This is not really a good thing; if your servers are spontaneously rebooting, you should know about it. We have a monitoring system, so of course the right answer is to have the monitoring system alert us when a system reboots.

(Some of you are laughing sadly right now.)

The problem with alert-on-reboot is that you get alerted for every reboot. Including all of the times that you deliberately reboot a machine. And unless you have serious problems, almost all system reboots are deliberate reboots, which means that you've created an alert that is almost entirely noise. Pretty soon you're going to be completely habituated to reboot alerts and you'll screen them out automatically. Just like all other alerts, in order to make reboot alerts work you need to make them low-noise. In other words, reboot alerts need to ignore deliberate reboots and only alert you when a machine reboots unexpectedly.

The best way to do this depends on your monitoring system. You can do it in the notifier agent that you run on your systems (such that it only sends a 'machine rebooted unexpectedly' alert under some circumstances), or you may be able to do it in the monitoring system itself if it's smart enough.

(If you do this, I think that you also should track all system reboots (but not alert on them). Partly this is a just in case issue, and partly because there may turn out to be other events that are correlated with system reboots, even deliberate ones. If you don't track all reboots, you can't spot these relationships.)

Comments on this page:

From at 2012-09-15 09:24:16:

One solution is to program your maintence windows into the alerting logic and suppress those that occur during a window. There will be times when purposeful reboots trigger alerts, but it should be tolerable presuming you have defined windows and use them.

From at 2012-09-15 13:31:47:

Where I work we have two "levels" of alerts: the 'notification' one goes to a list that includes the inbox of all sysadmins for that particular team, and the second includes the same folks as the first, but also the "pager" (read: BlackBerry) for on-call stuff.

If we tied reboot alerts to our monitoring system, it'd probably go to the simply notification one. As long as the service isn't interrupted there's no sense waking someone up about it. We generally only page/alert on service interruptions: so if one half of a redundant pair of devices goes down at 2 AM, we get notified so we know about it in the morning, but no one gets woken up unless the service health check itself reports breakage.

We don't tie reboots to our monitoring system (modulo the fact that if the machine is down long enough the ping check may fail), but rather just have a small start up script that sends out an e-mail about the reboot with the hostname and time.

We've found this "notification" and "alert" / "page" split handy: a lot of our more important infrastructure is redundant, so things keep running if only one device falls over. When I started a page went out for everything, and much sleep was lost. We triaged a bit and decided some pages go out during business hours, some go out during "daylight hours" (9-9, 7 days), and the most important stuff is 24/7.

This allowed people to know that when alarm went out it was actually important.

Written on 15 September 2012.
« What determines how much work a ZFS resilver has to do
The problem with noise »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sat Sep 15 00:57:19 2012
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.