Finding a good use for keep_firing_for in our Prometheus alerts
A while back (in 2.42.0), Prometheus
introduced a feature to artificially keep alerts firing for some
amount of time after their alert condition had cleared; this is
'keep_firing_for
'. At the time, I said that I didn't really
see a use for it for us, but I now
have to change that. Not only do we have a use for it, it's one
that deals with a small problem in our large scale alerts.
Our 'there is something big going on' alerts exist only to inhibit
our regular alerts. They trigger when there seems to be 'too much'
wrong, ideally fast enough that their inhibition effect stops the
normal alerts from going out. Because normal alerts from big issues
being resolved don't necessarily clean out immediately, we want our
large scale alerts to linger on for some time after the amount of
problems we have drop below their trigger point. Among other things,
this avoids a gotcha with inhibitions and resolved alerts. Because we created these alerts
before v2.42.0, we implemented the effect of lingering on by using
max_over_time()
on the alert conditions (this was the old
way of giving an alert a minimum duration).
The subtle problem with using max_over_time() this way is that it means you can't usefully use a 'for:' condition to de-bounce your large scale alert trigger conditions. For example, if one of the conditions is 'there are too many ICMP ping probe failures', you'd potentially like to only declare a large scale issue if this persisted for more than one round of pings; otherwise a relatively brief blip of a switch could trigger your large scale alert. But because you're using max_over_time(), no short 'for:' will help; once you briefly hit the trigger number, it's effectively latched for our large scale alert lingering time.
Switching to extending the large scale alert directly with
'keep_firing_for
' fixes this issue, and also simplifies the
alert rule expression. Once we're no longer using max_over_time(),
we can set 'for: 1m' or another useful short number to de-bounce
our large scale alert trigger conditions.
(The drawback is that now we have a single de-bounce interval for all of the alert conditions, whereas before we could possibly have a more complex and nuanced set of conditions. For us, this isn't a big deal.)
I suspect that this may be generic to most uses of max_over_time() in alert rule expressions (fortunately, this was our only use of it). Possibly there are reasonable uses for it in sub-expressions, clever hacks, and maybe also using times and durations (eg, also, also).
|
|