Finding a good use for keep_firing_for in our Prometheus alerts

November 12, 2024

A while back (in 2.42.0), Prometheus introduced a feature to artificially keep alerts firing for some amount of time after their alert condition had cleared; this is 'keep_firing_for'. At the time, I said that I didn't really see a use for it for us, but I now have to change that. Not only do we have a use for it, it's one that deals with a small problem in our large scale alerts.

Our 'there is something big going on' alerts exist only to inhibit our regular alerts. They trigger when there seems to be 'too much' wrong, ideally fast enough that their inhibition effect stops the normal alerts from going out. Because normal alerts from big issues being resolved don't necessarily clean out immediately, we want our large scale alerts to linger on for some time after the amount of problems we have drop below their trigger point. Among other things, this avoids a gotcha with inhibitions and resolved alerts. Because we created these alerts before v2.42.0, we implemented the effect of lingering on by using max_over_time() on the alert conditions (this was the old way of giving an alert a minimum duration).

The subtle problem with using max_over_time() this way is that it means you can't usefully use a 'for:' condition to de-bounce your large scale alert trigger conditions. For example, if one of the conditions is 'there are too many ICMP ping probe failures', you'd potentially like to only declare a large scale issue if this persisted for more than one round of pings; otherwise a relatively brief blip of a switch could trigger your large scale alert. But because you're using max_over_time(), no short 'for:' will help; once you briefly hit the trigger number, it's effectively latched for our large scale alert lingering time.

Switching to extending the large scale alert directly with 'keep_firing_for' fixes this issue, and also simplifies the alert rule expression. Once we're no longer using max_over_time(), we can set 'for: 1m' or another useful short number to de-bounce our large scale alert trigger conditions.

(The drawback is that now we have a single de-bounce interval for all of the alert conditions, whereas before we could possibly have a more complex and nuanced set of conditions. For us, this isn't a big deal.)

I suspect that this may be generic to most uses of max_over_time() in alert rule expressions (fortunately, this was our only use of it). Possibly there are reasonable uses for it in sub-expressions, clever hacks, and maybe also using times and durations (eg, also, also).

Written on 12 November 2024.
« Prometheus makes it annoyingly difficult to add more information to alerts
Implementing some Git aliases indirectly, in shell scripts »

Page tools: View Source.
Search:
Login: Password:

Last modified: Tue Nov 12 23:06:05 2024
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.