How we implement reboot notifications when our machines reboot in Prometheus

October 8, 2019

I wrote yesterday about why we generate alerts that our machines have rebooted, but not about how we do it. It turns out that there are a few little tricks about doing this in Prometheus, especially in an environment where you're using physical servers.

The big issue is that Prometheus isn't actually designed to send notifications; it's designed to have alerts. The difference between a notification and an alert is that you send a notification once and then you're done, while an alert is raised, potentially triggers various sorts of notifications after some delay, and then goes away. To abuse some terms, a notification is edge triggered while an alert is level triggered. To create a notification in a system that's designed for alerts, we basically need to turn the event we want to notify about into a level-triggering condition that we can alert on. This condition needs to be true for a while, so the alert is reliably triggered and sent (even in the face of delays or failure to immediately scrape the server's host agent), but it has to go away again sooner or later (otherwise we will basically have a constantly asserted alert that clutters things up).

So the first thing we need is a condition (ie, a Prometheus expression) that is reliably true if a server has rebooted recently. For Linux machines, what you want to use looks like this:

(node_time_seconds - node_boot_time_seconds) < (19*60) >= (5*60)

This condition is true between five minutes after the server rebooting and 19 minutes, and its value is how long the server has been up (in seconds), which is handy for putting in the actual notification we get. We delay sending the alert until the server has been up for a bit so that if we're repeatedly rebooting the server while working on it, we won't get a deluge of reboot notifications; you could make this shorter if you wanted.

(We turn the alert off after the odd 19 minutes because our alert suppression for large scale issues lingers for 20 minutes after the large scale situation seems to have stopped. By cutting off 'recent reboot' notifications just before that, we avoid getting a bunch of 'X recently rebooted' when a bunch of machines come back up in such a situation.)

The obvious way to write this condition is to use 'time()' instead of 'node_time_seconds'. The problem with this is that what the Linux kernel actually exposes is how long the system has been up (in /proc/uptime), not the absolute time of system boot. The Prometheus host agent turns this relative time into an absolute time, using the server's local time. If we use some other source of (absolute) time to try to re-create the time since reboot (such as Prometheus's idea of the current time), we run into problems if and when the server's clock changes after boot. As they say, ask me how I know; our first version used 'time()' and we had all sorts of delayed reboot notifications and so on when servers rebooted or powered on with bad time.

(This is likely to be less of an issue in virtualized environments because your VMs probably boot up with something close to accurate time.)

The other side of the puzzle is in Alertmanager, and comes in two parts. The first part is simply that we want our alert destination (the receiver) for this type of 'alerts' to not set send_resolved to true, the way our other receivers do; we only want to get email at the start of the 'alert', not when it quietly goes away. The second part is defeating grouping, because Alertmanager is normally very determined to group alerts together while we pretty much want to get one email per 'notification'. Unfortunately you can't tell Alertmanager to group by nothing ('[]'), so instead we have a long list of labels to 'group by' which in practice make each alert unique. The result looks like this:

- match:
    cstype: 'notify'
  group_by: ['alertname', 'cstype', 'host', 'instance', 'job', 'probe', 'sendto']
  receiver: notify-receiver
  group_wait: 0s
  group_interval: 5m

We put the special 'cstype' label on all of our notification type alerts in order to route them to this. Since we don't want to group things together and we do want notifications to be immediate, there's no point in a non-zero group_wait (it would only delay the email). The group_interval is to reduce how much email we'd get if a notification started flapping for some reason.

(The group interval interacts with how soon you trigger notifications, since it will effectively suppress genuine repeated notifications within that time window. This can affect how you want to write the notification alert expressions.)

Our Alertmanager templates have special handling for these notifications. Because they aren't alerts, they generate different email Subject: lines and have message bodies that talk about notifications instead of alerts (and know that there will never be 'resolved' notifications that they need to tell us about).

All in all using Prometheus and Alertmanager for this is a bit of a hack, but it works (and works well) and doing it this way saves us from having to build a second system for it. And, as I've mentioned before, this way Prometheus handles dealing with state for us (including the state of 'there is some sort of large scale issue going on, we don't need to be deluged with notes about machines booting up').

Written on 08 October 2019.
« Why we generate alert notifications about our machines having rebooted
Limiting the size of things in a filesystem is harder than it looks »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Tue Oct 8 21:11:12 2019
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.