2023-02-05
Some things on Prometheus's new feature to keep alerts firing for a while
In the past I've written about maybe avoiding flapping Prometheus
alerts, which is a topic of interest
to us for obvious reasons. One of the features in Prometheus
2.42.0
is a new 'keep_firing_for
' setting for alert rules (documented
in Recording rules,
see also the pull request). As described
in the documentation, it specifies 'how long an alert will continue
firing after the condition that triggered it has cleared' and
defaults to being off (0 seconds).
The obvious use of 'keep_firing_for
' is to avoid having your
alerts flap too much. If you set it to some non-zero value, say a
minute, then if the alert condition temporarily goes away only to
come back within a minute, you won't potentially wind up notifying
people that the alert went away then notify them again that it came
back. I say 'potentially', because when you can get notified about
an alert going away is normally quantized by your Alertmanager
group_interval
setting. This
simple alert rule setting can replace more complex methods of
avoiding flapping alerts, and so
there are various people who will likely use it.
When 2.42.0 came out recently with this feature, I started thinking about whether we would use it. My reluctant conclusion is that we probably won't in most places, because it doesn't do quite what we want and it has some side effects that we care about (although these side effects are the same as most of the other ways of avoiding flapping alerts). The big side effect is that this doesn't delay or suppress notifications about the alert ending, it delays the alert itself ending. The delay in notification is a downstream effect of the alert itself remaining active. If you care about being able to visualize the exact time ranges of alerts in (eg) Grafana, then artificially keeping alerts firing may not be entirely appealing.
(This is especially relevant if you keep your metrics data for a
long time, as we do. Our alert rules evolve over time, so without
a reliable ALERTS
metric we might have to go figure out the
historical alert rule to recover the alert end time for a long-past
alert.)
This isn't the fault of 'keep_firing_for
', which is doing
exactly what it says it does and what people have asked
for. Instead it's because we care (potentially) more about
delaying and aggregating alert notifications than we do about
changing the timing of the actual alerts. What I actually
want is something rather more complicated than Alertmanager supports,
and is for another entry.