An issue with Alertmanager inhibitions and resolved alerts
Prometheus Alertmanager has a feature called inhibitions, where one alert can inhibit other alerts. We use this in a number of situations, such as our special 'there is a large scale problem' alert inhibiting other alerts and some others. Recently I realized that there is a complication in how inhibitions interact with being notified about resolved alerts (due to this mailing list thread).
Suppose that you have an inhibition rule to the effect that alert
A ('this host is down') inhibits alert B ('this special host daemon
is down'), and you send notifications on resolved alerts. With alert
A in effect, every time Alertmanager goes to send out a notification
for the alert group that alert B is part of, Alertmanager will see
that alert B is inhibited and filter it out (as far as I can tell
this is the basic effect of Alertmanager silences, inhibitions, and
mutes). Such notifications will
(potentially) happen on every group_interval
tick.
Now suppose that both alert A and alert B resolve at more or less
the same time (because the host is back up along with its special
daemon). Alertmanager doesn't immediately send notifications for
resolved alerts; instead, just like
all other alert group re-notifications, they wait for the next
group_interval
tick. When this tick happens, alert B will be a
resolved alert that you should normally be notified about, and alert
A will no longer be active and so no longer inhibiting it. You'll
receive a potentially surprising notification about the now-resolved
alert B, even though it was previously inhibited while it was active
(and so you may never have received an initial notification that
it was active).
(Although I described it as both alerts resolving around the same time, it doesn't have to be that way; alert A might have ended later than B, with some hand-waving and uncertainty. The necessary condition is for alert A and its inhibition to no longer be in effect when Alertmanager is about to process a notification that includes alert B's resolution.)
The consequence of this is that if you want inhibitions to reliably
suppress notification about resolved alerts, you need the inhibiting
alert to be active at least one group_interval
longer than the
alerts it's inhibiting. In some cases this is easy to arrange, but in
other cases it may be troublesome and so you may want to simply live
with the extra notifications about resolved alerts.
(The longer your 'group_interval
' setting is, the worse this gets,
but there are a number of reasons you probably want group_interval
to be relatively short, including prompt notifications about resolved
alerts under normal circumstances.)
|
|