Prometheus, Alertmanager, and maybe avoiding flapping alerts
One of the things I wish about Prometheus and Alertmanager is that they had better and more accessible support for avoiding flapping alerts. Flapping alerts are alerts that (if written naturally) stutter back and forth between being on and being off. One natural way to get a stuttering alert is to write a simple, natural PromQL alert rule expression like:
expr: some_metric > 1000
If the metric is hovering just around 1000, sometimes above and
sometimes below, this alert will turn on and then off again on a
frequent basis. The only limit on how often you'll get notified
about it is your Alertmanager group_wait
and group_interval
settings (see my entry on alert delays and timings for more).
We can sort of de-bounce this slightly by requiring the condition to be true for a certain length of time:
expr: some_metric > 1000 for: 3m
But note what we've done here; we've made it so that we'll never
be alerted for any spike in this metric that's less than three
minutes long (or actually three minutes plus your group_wait
,
most likely). This is not so much de-bouncing the alert as changing
what we alert on. We can require longer and longer time intervals,
which will generate fewer alerts, but also change the meaning of
the alert more and more.
We can force an alert to have a minimum duration by using
max_over_time
:
expr: max_over_time( some_metric[10m] ) > 1000
Now this alert will always fire for ten minutes, even if the metric
is only briefly over 1000. There's no real point in combining this
with a for
condition unless the for
is for longer than the
minimum duration. With a subquery using min_over_time
you can
do a brute force for
for its original purpose of requiring a the
metric to always be high for a certain period:
expr: max_over_time( ( min_over_time(some_metric[3m]) )[10m:15s] ) > 1000
Once there's a three minute interval where the metric is always over 1000, we latch the alert on for ten minutes (and then may not immediately get notified about it being 'resolved').
Since Prometheus exposes pending and firing alerts in the ALERTS
and ALERTS_FOR_STATE
metrics,
you can glue together even more complicated and hard to follow PromQL
expressions that look back to see whether there have been recent alerts
for your expression, and perhaps how many times it's fired recently. We don't do this today and I'm not going
to try to write out untested example expressions, so this is left as an
exercise to the reader.
There are three issues here in general. The first is that none of
this is natural. The innocent natural way to write PromQL alert
rules leaves you with flapping alerts. The second is that there are
no simple, easy to follow fixes. Some of the obvious fixes, like
monkeying around with group_wait
and group_interval
, can
have unexpected side effects because of how group_interval
actually works.
Finally, it's very hard (if it's possible at all) to do more
sophisticated things. For example, it would be nice to be able to
promptly alert and de-alert the first time (or the first few) but
then start backing off on re-alerts so that you get fewer and fewer
of them (or you get some different summary of the situation). It
might be possible to do this sort of backoff in a PromQL alert rule,
given the ALERTS
metric, but it's certainly not going to be
easy to write or to follow later.
Avoiding flapping alerts is a hard problem, and there are real questions of what you want to do about them (especially within Prometheus's model of alerts). But I wish that Prometheus and Alertmanager at least exposed more tools to let you deal with the problem and didn't hand out foot-guns so readily.
(Probably most of our alerts could flap, although they usually
don't, just because it's so much more complex to try to de-flap
things beyond a 'for
' interval.)
|
|