Why we generate alert notifications about our machines having rebooted

October 7, 2019

Part of our Prometheus alerts is an alert that triggers whenever a machine has been recently rebooted. My impression is that having such alerts these days is unusual, so today I'm writing up the two reasons why we have this alert.

(This is an 'alert' in the sense that all of the output from our Prometheus and Alertmanager is an 'alert', but it is not an alert in the sense of bothering someone outside of working hours. All of our alerts go only to email, and we only pay attention to email during working hours.)

The first reason is that our machines aren't normally supposed to reboot (even most of the ones that are effectively cattle instead of pets, although there are some exceptions). Any unexpected reboot is an anomaly that we want to investigate to try to figure out what's going on. Did we have a power glitch in the middle of the night? Did something run into a kernel panic? And so on. Our mechanism for getting notified about these anomalies is email and the easiest way to send that email is as an 'alert'.

But that's only part of the story, because we don't just monitor these machines to see if they reboot, we also monitor them to see if they go down and trigger alerts if they do. Our machines don't take forever to reboot, but with all of the twiddling around the modern BIOSes perform they do take long enough that our regular 'the machine is down' alerts should fire. So the second reason that we have a specific reboot alert is because we delay the regular 'machine is down' alerts for long enough that they won't actually fire if the machine is just rebooting immediately; without an additional specific alert, we wouldn't get anything at all. We do this because we'd rather get one email message if a machine reboots instead of two (a 'down machine' alert email and then an 'it cleared up' resolved alert email).

(We consider some machines sufficiently critical that we don't do this, triggering immediate 'down machine' alerts without waiting to see if it's because of a reboot. But not very many.)

There's an additional reason that I like reboot notifications, which is that I feel they're useful as a diagnostic to explain why a machine suddenly dropped off the network for a while. Whether or not we triggered an explicit alert about the machine disappearing, it did and that may have effects that show up elsewhere (in logs, in user reports, or whatever). With a reboot notification, we immediately know why without having to dig into the situation by hand.

Written on 07 October 2019.
« Automating our 'bookable' compute servers with SLURM has created generic 'cattle' machines
How we implement reboot notifications when our machines reboot in Prometheus »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Oct 7 23:44:12 2019
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.