Some things on Prometheus's new feature to keep alerts firing for a while

February 5, 2023

In the past I've written about maybe avoiding flapping Prometheus alerts, which is a topic of interest to us for obvious reasons. One of the features in Prometheus 2.42.0 is a new 'keep_firing_for' setting for alert rules (documented in Recording rules, see also the pull request). As described in the documentation, it specifies 'how long an alert will continue firing after the condition that triggered it has cleared' and defaults to being off (0 seconds).

The obvious use of 'keep_firing_for' is to avoid having your alerts flap too much. If you set it to some non-zero value, say a minute, then if the alert condition temporarily goes away only to come back within a minute, you won't potentially wind up notifying people that the alert went away then notify them again that it came back. I say 'potentially', because when you can get notified about an alert going away is normally quantized by your Alertmanager group_interval setting. This simple alert rule setting can replace more complex methods of avoiding flapping alerts, and so there are various people who will likely use it.

When 2.42.0 came out recently with this feature, I started thinking about whether we would use it. My reluctant conclusion is that we probably won't in most places, because it doesn't do quite what we want and it has some side effects that we care about (although these side effects are the same as most of the other ways of avoiding flapping alerts). The big side effect is that this doesn't delay or suppress notifications about the alert ending, it delays the alert itself ending. The delay in notification is a downstream effect of the alert itself remaining active. If you care about being able to visualize the exact time ranges of alerts in (eg) Grafana, then artificially keeping alerts firing may not be entirely appealing.

(This is especially relevant if you keep your metrics data for a long time, as we do. Our alert rules evolve over time, so without a reliable ALERTS metric we might have to go figure out the historical alert rule to recover the alert end time for a long-past alert.)

This isn't the fault of 'keep_firing_for', which is doing exactly what it says it does and what people have asked for. Instead it's because we care (potentially) more about delaying and aggregating alert notifications than we do about changing the timing of the actual alerts. What I actually want is something rather more complicated than Alertmanager supports, and is for another entry.

Written on 05 February 2023.
« The practical appeal of a mesh-capable VPN solution
Rsync'ing (only) some of the top level pieces of a directory »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sun Feb 5 22:55:15 2023
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.