How we implement reboot notifications when our machines reboot in Prometheus
I wrote yesterday about why we generate alerts that our machines have rebooted, but not about how we do it. It turns out that there are a few little tricks about doing this in Prometheus, especially in an environment where you're using physical servers.
The big issue is that Prometheus isn't actually designed to send notifications; it's designed to have alerts. The difference between a notification and an alert is that you send a notification once and then you're done, while an alert is raised, potentially triggers various sorts of notifications after some delay, and then goes away. To abuse some terms, a notification is edge triggered while an alert is level triggered. To create a notification in a system that's designed for alerts, we basically need to turn the event we want to notify about into a level-triggering condition that we can alert on. This condition needs to be true for a while, so the alert is reliably triggered and sent (even in the face of delays or failure to immediately scrape the server's host agent), but it has to go away again sooner or later (otherwise we will basically have a constantly asserted alert that clutters things up).
So the first thing we need is a condition (ie, a Prometheus expression) that is reliably true if a server has rebooted recently. For Linux machines, what you want to use looks like this:
(node_time_seconds - node_boot_time_seconds) < (19*60) >= (5*60)
This condition is true between five minutes after the server rebooting and 19 minutes, and its value is how long the server has been up (in seconds), which is handy for putting in the actual notification we get. We delay sending the alert until the server has been up for a bit so that if we're repeatedly rebooting the server while working on it, we won't get a deluge of reboot notifications; you could make this shorter if you wanted.
(We turn the alert off after the odd 19 minutes because our alert suppression for large scale issues lingers for 20 minutes after the large scale situation seems to have stopped. By cutting off 'recent reboot' notifications just before that, we avoid getting a bunch of 'X recently rebooted' when a bunch of machines come back up in such a situation.)
The obvious way to write this condition is to use '
node_time_seconds'. The problem with this is that what the
Linux kernel actually exposes is how long the system has been up
/proc/uptime), not the absolute time of system boot. The
Prometheus host agent turns this relative time into an absolute
time, using the server's local time. If we use some other source
of (absolute) time to try to re-create the time since reboot (such
as Prometheus's idea of the current time), we run into problems if
and when the server's clock changes after boot. As they say, ask
me how I know; our first version used '
time()' and we had all
sorts of delayed reboot notifications and so on when servers rebooted
or powered on with bad time.
(This is likely to be less of an issue in virtualized environments because your VMs probably boot up with something close to accurate time.)
The other side of the puzzle is in Alertmanager, and comes in two
parts. The first part is simply that we want our alert destination
(the receiver) for this type of 'alerts' to not set
to true, the way our other receivers do; we only want to get email
at the start of the 'alert', not when it quietly goes away. The
second part is defeating grouping, because Alertmanager is normally
very determined to group alerts together while we pretty much want
to get one email per 'notification'. Unfortunately you can't tell
Alertmanager to group by nothing ('
'), so instead we have a
long list of labels to 'group by' which in practice make each alert
unique. The result looks like this:
- match: cstype: 'notify' group_by: ['alertname', 'cstype', 'host', 'instance', 'job', 'probe', 'sendto'] receiver: notify-receiver group_wait: 0s group_interval: 5m
We put the special '
cstype' label on all of our notification type
alerts in order to route them to this. Since we don't want to group
things together and we do want notifications to be immediate, there's
no point in a non-zero
group_wait (it would only delay the
group_interval is to reduce how much email we'd get
if a notification started flapping for some reason.
(The group interval interacts with how soon you trigger notifications, since it will effectively suppress genuine repeated notifications within that time window. This can affect how you want to write the notification alert expressions.)
Our Alertmanager templates have special handling for these
notifications. Because they aren't alerts, they generate different
Subject: lines and have message bodies that talk about
notifications instead of alerts (and know that there will never
be 'resolved' notifications that they need to tell us about).
All in all using Prometheus and Alertmanager for this is a bit of a hack, but it works (and works well) and doing it this way saves us from having to build a second system for it. And, as I've mentioned before, this way Prometheus handles dealing with state for us (including the state of 'there is some sort of large scale issue going on, we don't need to be deluged with notes about machines booting up').