What I want in Prometheus (as a whole) is aggregating alert notifications

February 7, 2023

I recently looked at Prometheus's new feature to keep alerts firing for a while (often to avoid flapping alerts) and in the process realized that it wasn't really what I want. The simple way to put it is what I care about getting less of is not the alerts themselves, but alert notifications. And for that, what I really want is for notifications that can (at some point) aggregate together information about multiple alerts over time. Instead of getting one notification each time the alert triggers, perhaps I would get one notification every twenty minutes telling me, say, that the alert triggered three times in the last twenty minutes for a total of seven minutes when it was active (and I can look at a dashboard if I want to know exactly when). This preserves relatively precise alert times in Prometheus itself while not dumping too many notifications on us.

(Specifically it preserves accurate details about when alerts were firing in the metrics database.)

This aggregation obviously can't happen in Prometheus itself; Prometheus cares about alerts, not alert notifications. It also doesn't really fit in the current model for Alertmanager. Alert aggregation over time and how to present it in notifications is a complex area; trying to put this in Alertmanager would add a lot of complication to a core component that a lot of people are pretty happy with today (us included). Practically speaking this probably needs to be a separate component that will keep its own ongoing database of (recent) past alerts, notification times, and so on; the obvious implementation approach today would be as an Alertmanager webhook.

(The list of webhook receivers includes a logging one and one that dumps alerts into MySQL, which I want to note since I've looked at it now.)

With that said, you can do some alert aggregation today in Prometheus if you're willing to have 'alerts' that don't always turn off (or perhaps turn on) when the underlying condition does. You can, for example, suppress or extend an alert when it has triggered enough times in the recent past through creative use of the changes() function (and I mentioned this possibility back in my entry on maybe avoiding flapping alerts). This will indirectly 'aggregate' notifications about the alert triggering and resolving by not actually resolving and then re-triggering the alert.

Within Alertmanager, your only 'aggregation' choice today is a long group_interval. This may be tolerable if you don't care about getting relatively promptly notified about resolved alerts. Unfortunately, from what I remember of the Alertmanager code involved here it would be hard to have a second version of group_interval that only applied to resolved alerts.

(I wouldn't say that Alertmanager is 'stateless', but it does try to keep relatively little state, especially once things are over. This is sensible if you're in a large scale environment where a ton of alerts from a ton of different groups go sluicing through the system.)

Since there's no way to do it today, I haven't thought very much about what we'd want in a hypothetical alert notification aggregation environment. There's an obvious tradeoff between prompt notifications of a new situation and aggregating quick-cycling alerts together, so we'd probably want no aggregation to happen until an alert had bounced around 'too much' in the recent past. Or maybe this should be phrased as 'only send N notifications about any particular group of alerts in X minutes', so you'd have an initial notification budget that could be used up in individual alert and resolution notifications, but once you'd hit the rate limit, things would get aggregated.

(Rate-limiting separate alert notifications strikes me as useful mental model, although as mentioned I wouldn't want rate limited notifications to disappear entirely; I'd want some sort of summary of them. A crude approach would be to append all of the individual notifications together, following the old model of getting mailing list messages in periodic digests.)

Written on 07 February 2023.
« Rsync'ing (only) some of the top level pieces of a directory
The general 'recursive routing' problem in IP networking »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Tue Feb 7 22:57:48 2023
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.