Prometheus alerts and the idea of "deadbands" (or maybe hysteresis) (with an implementation)

August 13, 2021

In a comment on my entry on maybe avoiding flapping alerts, antiphase brought up the concept of a deadband, although what I'm going to talk about might be considered hysteresis instead. Put informally, the idea is that you have a different threshold for turning an alert on and for turning it off. For example, you could trigger an alert when some_metric went over a value of 1000, but not turn the alert off until the value fell below 900. This band where the alert won't turn on but will stay on if it's already on de-flaps the alert by effectively requiring much larger swings in the metric value to re-trigger it; it can't be triggered repeatedly by a small oscillation around 1000.

Prometheus has no native support for this. Alert rule expressions are either 'true' (ie, yielding a value) or they aren't. If they're true, the alert is firing; if they're not, the alert isn't. There's no separate alert rule expression for when to stop triggering an alert. But since Prometheus exposes a metric for whether an alert is firing, we can (in theory) write our own deadband expression.

The following is untested (except for syntax), but I think it would generally be like this (note that this is not proper YAML syntax for a multi-line PromQL expression, I'm not looking up the YAML string embedding rules tonight):

- alert: SomeAlert
  expr: some_metric > 1000 or \
        ( some_metric >= 900 \
           and ignoring(alertname, alertstate, ...) \
            ALERTS{alertname="SomeAlert", alertstate="firing"} )

If you add extra labels to your SomeAlert alert in the alert rule, you'll need to add them to the ignoring().

The first simple expression is our initial trigger, that some_metric is above 1000. The first bit of our parenthesized expression is our setting for not turning off the alert (ie, for continuing it), which is that some_metric is 900 or higher. Then the whole 'and ignoring(...) ALERTS{...}' portion of the expression is the simple condition of 'is the alert currently firing'. So the alert should be on if either the metric is above the initial trigger level, or the alert is currently on and the metric hasn't yet fallen below our cut-off value.

This alert rule can usefully be used with a 'for' limitation, which would make it not trigger until some_metric had been above 1000 for however long. If you use a 'for', you probably really want to make sure you restrict the ALERTS match to a firing alert. Otherwise what you have is an alert that will trigger if at one point some_metric goes above 1000 and then doesn't fall below 900 for the 'for' duration. (Of course, you might want such an alert.)

While this works (I think), I'm relatively sure that this is being too clever and complicated. Probably you want to try to use other approaches to de-flapping alerts, ones that are simpler and easier to understand. But if someday I absolutely have to do this, at least I've worked it out now.

