When Prometheus Alertmanager will tell you about resolved alerts

November 20, 2018

We've configured Prometheus to notify us when our alerts clear up. Recently, we got such a notification email and its timing attracted my attention because while the alert in it had a listed end time of 9:43:55, the actual email was only sent at 9:45:40, which is not entirely timely. This got me curious about what affects how soon you get notified about alerts clearing in Prometheus, which led to two significant surprises. The overall summary is that the current version of Alertmanager can't give you prompt notifications of resolved alerts, and on top of that right now Alertmanger's 'ended-at' times are generally going to be late by as many as several minutes.

Update: It turns out that the second issue is fixed in the recently released Alertmanager 0.15.3. You probably want to upgrade to it.

Alertmanager's relatively minimal documentation describes the group_interval setting as follows:

How long to wait before sending a notification about new alerts that are added to a group of alerts for which an initial notification has already been sent. (Usually ~5m or more.)

If you read this documentation, you would expect that this is a 'no more often than' interval. By this I mean that you have to wait at least group_interval before Alertmanager will send another notification, but that once this time has elapsed, Alertmanager will immediately send a new alert out. This turns out to not be the case. Instead, this is a tick interval; Alertmanager only sends a new notification every group_interval (if there is anything new). For example, if you have a group_interval of 10m and a new alert in the group shows up 11 minutes after the first notification, Alertmanager will not send out a notification until 20 minutes, which is the next 'tick' of 'every 10 minutes'. Resolved alerts are treated just the same as new alerts here.

This makes a certain amount of sense in Alertmanager's apparent model of the alerting world, but it's quite possibly not what you either expect or want. Certainly it's not what we want, and it's probably going to cause us to change our group_interval settings to basically be the same (or shorter) than our group_wait settings.

For the rest of this, I'm going to go through the flow of what happens when an alert ends, as far as I can tell.

When a Prometheus alert triggers and enters the firing state, Prometheus sends the alert to Alertmanager. As covered in Alertmanager's API documentation, Prometheus will then re-send the alert every so often. At the moment, Prometheus throttles itself to re-sending a particular alert only once a minute, although you can change this with a command line option. Re-sending an alert doesn't change the labels (one hopes), but it can change the annotations; Prometheus will re-create them every time (as part of re-evaluating the alert rule) and Alertmanager always uses the most recently received annotations if it needs to generate a new notification.

(Of course this doesn't matter if your annotations are constant. Our annotations can include the current state of affairs, which means that a succession of alert notifications can have different annotation text as, say, the reported machine room temperature fluctuates.)

When a firing alert's rule condition stops being true, Prometheus doesn't just drop and delete the alert (although this is what it looks like in the web UI). Instead, it sets the rule's 'ResolvedAt' time, switches it to a state that Prometheus internally labels as 'inactive', and keeps it around for another 15 minutes. During these 15 minutes, Prometheus will continue to send it to Alertmanager, which is how Alertmanager theoretically learns that it's been resolved. The first (re-)send after an alert has been resolved is not subject to the regular 60 second minimum re-send interval; it happens immediately, so in theory Alertmanager should immediately know that the alert is resolved. As a side note, the annotations on a resolved alert will be the annotations from the last pre-resolution version of it.

(It turns out that Prometheus always sends alerts to Alertmanager with an 'EndsAt' time filled in. If the rule has been resolved, this is the 'ResolvedAt' time; if it hasn't been resolved, the 'EndsAt' is an internally calculated timeout that's some distance into the future. This may be relevant for alert notification templates. This also appears to mean that Alertmanager's resolve_timeout setting is unused, because the code makes it seem like it's only used for alerts without their own EndsAt time.)

Then we run into an Alertmanager issue that is probably fixed by this recent commit, where the actual resolved alert that Prometheus sent to Alertmanager effectively gets ignored and Alertmanager fails to notice that the alert is actually resolved. Instead, the alert has to reach its earlier expiry time before it becomes resolved, which is generally going to be about three minutes from the last time Prometheus sent the still-firing alert to Alertmanager. In turn, that time may be anywhere up to nearly 60 seconds before Prometheus decided that the alert was resolved.

(Prometheus will often evaluate the alert rule more often than once every 60 seconds, but if the alert rule is still true, it will only send that result to Alertmanager once every 60 seconds.)

This Alertmanager delay in recognizing that the alert is resolved combines in an unfortunate way with the meaning of group_interval, because it can make you miss the group_interval 'tick' and then have your 'alert is resolved' notification delayed until the next tick, however many minutes away it is. To minimize this time, you need to reduce group_interval down to whatever you can tolerate and then set --rules.alert.resend-delay down relatively low, say 20 or 30 seconds. With a 20 second resend delay, the expiry timeout is only a minute, which means a bit less than a minute's delay at most before Alertmanager notices that your alert has resolved.

(You also need your evaluation_interval to be no more than your resend delay.)

(When the next Alertmanager release comes out with this bug fixed, you can stop setting your resend delay but you'll still want to have a low group_interval. It seems unlikely that its behavior will change, given the discussion in issue 1510.)

PS: If you want a full dump of what Alertmanager thinks of your current alerts, the magic trick is:

curl -s http://localhost:9093/api/v1/alerts | jq .

This gives you a bunch more information than Alertmanager's web UI will show. It does exclude alerts that Alertmanager thinks are resolved but hasn't deleted yet, though.

Written on 20 November 2018.
« Old zombie Linux distribution versions aren't really doing you any favours
What I really miss when I don't have X across the network »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Tue Nov 20 01:31:20 2018
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.