Prometheus alerts and the idea of "deadbands" (or maybe hysteresis) (with an implementation)
In a comment on my entry on maybe avoiding flapping alerts, antiphase brought up the concept
of a deadband, although
what I'm going to talk about might be considered hysteresis instead. Put informally, the idea is
that you have a different threshold for turning an alert on and for
turning it off. For example, you could trigger an alert when
some_metric
went over a value of 1000, but not turn the alert
off until the value fell below 900. This band where the alert won't
turn on but will stay on if it's already on de-flaps the alert by
effectively requiring much larger swings in the metric value to
re-trigger it; it can't be triggered repeatedly by a small oscillation
around 1000.
Prometheus has no native support for this. Alert rule expressions are either 'true' (ie, yielding a value) or they aren't. If they're true, the alert is firing; if they're not, the alert isn't. There's no separate alert rule expression for when to stop triggering an alert. But since Prometheus exposes a metric for whether an alert is firing, we can (in theory) write our own deadband expression.
The following is untested (except for syntax), but I think it would generally be like this (note that this is not proper YAML syntax for a multi-line PromQL expression, I'm not looking up the YAML string embedding rules tonight):
- alert: SomeAlert expr: some_metric > 1000 or \ ( some_metric >= 900 \ and ignoring(alertname, alertstate, ...) \ ALERTS{alertname="SomeAlert", alertstate="firing"} )
If you add extra labels to your SomeAlert
alert in the alert rule,
you'll need to add them to the ignoring()
.
The first simple expression is our initial trigger, that some_metric
is above 1000. The first bit of our parenthesized expression is our
setting for not turning off the alert (ie, for continuing it), which
is that some_metric
is 900 or higher. Then the whole 'and
ignoring(...) ALERTS{...}
' portion of the expression is the simple
condition of 'is the alert currently firing'. So the alert should
be on if either the metric is above the initial trigger level, or
the alert is currently on and the metric hasn't yet fallen below
our cut-off value.
This alert rule can usefully be used with a 'for
' limitation,
which would make it not trigger until some_metric
had been above
1000 for however long. If you use a 'for
', you probably really
want to make sure you restrict the ALERTS
match to a firing alert.
Otherwise what you have is an alert that will trigger if at one point
some_metric
goes above 1000 and then doesn't fall below 900 for
the 'for
' duration. (Of course, you might want such an alert.)
While this works (I think), I'm relatively sure that this is being too clever and complicated. Probably you want to try to use other approaches to de-flapping alerts, ones that are simpler and easier to understand. But if someday I absolutely have to do this, at least I've worked it out now.
|
|