Wandering Thoughts archives

2021-08-12

Prometheus, Alertmanager, and maybe avoiding flapping alerts

One of the things I wish about Prometheus and Alertmanager is that they had better and more accessible support for avoiding flapping alerts. Flapping alerts are alerts that (if written naturally) stutter back and forth between being on and being off. One natural way to get a stuttering alert is to write a simple, natural PromQL alert rule expression like:

expr: some_metric > 1000

If the metric is hovering just around 1000, sometimes above and sometimes below, this alert will turn on and then off again on a frequent basis. The only limit on how often you'll get notified about it is your Alertmanager group_wait and group_interval settings (see my entry on alert delays and timings for more).

We can sort of de-bounce this slightly by requiring the condition to be true for a certain length of time:

expr: some_metric > 1000
for: 3m

But note what we've done here; we've made it so that we'll never be alerted for any spike in this metric that's less than three minutes long (or actually three minutes plus your group_wait, most likely). This is not so much de-bouncing the alert as changing what we alert on. We can require longer and longer time intervals, which will generate fewer alerts, but also change the meaning of the alert more and more.

We can force an alert to have a minimum duration by using max_over_time:

expr: max_over_time( some_metric[10m] ) > 1000

Now this alert will always fire for ten minutes, even if the metric is only briefly over 1000. There's no real point in combining this with a for condition unless the for is for longer than the minimum duration. With a subquery using min_over_time you can do a brute force for for its original purpose of requiring a the metric to always be high for a certain period:

expr: max_over_time( ( min_over_time(some_metric[3m]) )[10m:15s] ) > 1000

Once there's a three minute interval where the metric is always over 1000, we latch the alert on for ten minutes (and then may not immediately get notified about it being 'resolved').

Since Prometheus exposes pending and firing alerts in the ALERTS and ALERTS_FOR_STATE metrics, you can glue together even more complicated and hard to follow PromQL expressions that look back to see whether there have been recent alerts for your expression, and perhaps how many times it's fired recently. We don't do this today and I'm not going to try to write out untested example expressions, so this is left as an exercise to the reader.

There are three issues here in general. The first is that none of this is natural. The innocent natural way to write PromQL alert rules leaves you with flapping alerts. The second is that there are no simple, easy to follow fixes. Some of the obvious fixes, like monkeying around with group_wait and group_interval, can have unexpected side effects because of how group_interval actually works.

Finally, it's very hard (if it's possible at all) to do more sophisticated things. For example, it would be nice to be able to promptly alert and de-alert the first time (or the first few) but then start backing off on re-alerts so that you get fewer and fewer of them (or you get some different summary of the situation). It might be possible to do this sort of backoff in a PromQL alert rule, given the ALERTS metric, but it's certainly not going to be easy to write or to follow later.

Avoiding flapping alerts is a hard problem, and there are real questions of what you want to do about them (especially within Prometheus's model of alerts). But I wish that Prometheus and Alertmanager at least exposed more tools to let you deal with the problem and didn't hand out foot-guns so readily.

(Probably most of our alerts could flap, although they usually don't, just because it's so much more complex to try to de-flap things beyond a 'for' interval.)

sysadmin/PrometheusAlertmanagerFlapping written at 00:35:28; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.