Why we put alert start and end times in our Prometheus alert messages

June 5, 2020

As I mentioned in Formatting alert start and end times in Alertmanager messages, we put the alert start times and if applicable the alert end times in the (email) alert messages that we send out. Generally these look like one of these two:

  • for a current alert
    (alert started at 15:16:02 EDT 2020-06-05, likely detected ~90s earlier)

  • for an alert that has ended
    (alert active from 15:16:02 EDT 2020-06-05 to 15:22:02 EDT 2020-06-05)

(The 'likely detected ..' bit is there because most of our Prometheus alert rules have a 'for:' clause, so the alert condition becomes true somewhat before the alert itself starts.)

At the beginning of life with Prometheus and Alertmanager, it may not be obvious why this is useful and sometimes even necessary; after all, the alert message itself already has a time when it was emailed, posted to your communication channel, or whatever.

The lesser reason we do this, especially for alert end times, is that it's convenient to have this information in one place when we're going back through email. If we have a 'this alert is resolved' email, we don't have to search back to see when it started; the information is right there in the final message. There's a similar but smaller convenience with email about the start of single alerts, since you can just directly read off the start time from the text of the message without looking back to however your mail client is displaying the email's sending time.

The larger reason is how Alertmanager works with grouped alerts (which is almost all of our alerts). Alertmanager's core model is that rather than sending you new alerts or resolved alerts (or both), it will send you the entire current state of the group's alerts any time that stage changes. What this means is that if at first alert A is raised, then somewhat later alert B, then finally alert C, you will get an email listing 'alert A is active', then one saying 'alert A and B are active', then a third saying 'alerts A, B, and C are active'.

When you get these emails, you generally want to know what alerts are new and what alerts are existing older alerts. You're probably already looking at the existing alerts, but the new alerts may be for new extra problems that you also need to look at, and they may be a sign that things are getting worse. And this is why you want the alert start times, because they let you tell which alerts are more recent (and more likely to be new ones you haven't seen before) and which ones are older. It's not as good as being clearly told which alerts are new in this message, but it's as good as we can get in the Alertmanager model of the world.

(I don't know if Alertmanager puts the alerts in these messages in any particular order. Even if it does so today, there's no documentation about it so it's not an official feature and may change in the future. It would be nice if Alertmanager used a documented and useful order, or let you sort the alerts based on start and end times.)

Comments on this page:

By Simon at 2020-06-08 03:36:09:

Right now, Alertmanager will sort alerts within a group by the "job" and "instance" labels [1]. There's an longstanding issue [2] to allow sorting on arbitrary fields but it's a bit contentious.

[1] https://github.com/prometheus/alertmanager/blob/12da9d6570ce0f487f98eed1be32e9c7e53da6b1/types/types.go#L321-L345

[2] https://github.com/prometheus/alertmanager/issues/1178

Written on 05 June 2020.
« Formatting alert start and end times in Prometheus Alertmanager messages
Why sysadmins don't like changing things, illustrated »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Jun 5 22:37:09 2020
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.