Wandering Thoughts archives


Sensible reboot monitoring

I have an embarrassing confession: we recently discovered that some of our machines had been spontaneously rebooting every so often, and we hadn't noticed. This is not really a good thing; if your servers are spontaneously rebooting, you should know about it. We have a monitoring system, so of course the right answer is to have the monitoring system alert us when a system reboots.

(Some of you are laughing sadly right now.)

The problem with alert-on-reboot is that you get alerted for every reboot. Including all of the times that you deliberately reboot a machine. And unless you have serious problems, almost all system reboots are deliberate reboots, which means that you've created an alert that is almost entirely noise. Pretty soon you're going to be completely habituated to reboot alerts and you'll screen them out automatically. Just like all other alerts, in order to make reboot alerts work you need to make them low-noise. In other words, reboot alerts need to ignore deliberate reboots and only alert you when a machine reboots unexpectedly.

The best way to do this depends on your monitoring system. You can do it in the notifier agent that you run on your systems (such that it only sends a 'machine rebooted unexpectedly' alert under some circumstances), or you may be able to do it in the monitoring system itself if it's smart enough.

(If you do this, I think that you also should track all system reboots (but not alert on them). Partly this is a just in case issue, and partly because there may turn out to be other events that are correlated with system reboots, even deliberate ones. If you don't track all reboots, you can't spot these relationships.)

sysadmin/SensibleRebootMonitoring written at 00:57:19; Add Comment

Page tools: See As Normal.
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.