An issue with Alertmanager inhibitions and resolved alerts

April 2, 2024

Prometheus Alertmanager has a feature called inhibitions, where one alert can inhibit other alerts. We use this in a number of situations, such as our special 'there is a large scale problem' alert inhibiting other alerts and some others. Recently I realized that there is a complication in how inhibitions interact with being notified about resolved alerts (due to this mailing list thread).

Suppose that you have an inhibition rule to the effect that alert A ('this host is down') inhibits alert B ('this special host daemon is down'), and you send notifications on resolved alerts. With alert A in effect, every time Alertmanager goes to send out a notification for the alert group that alert B is part of, Alertmanager will see that alert B is inhibited and filter it out (as far as I can tell this is the basic effect of Alertmanager silences, inhibitions, and mutes). Such notifications will (potentially) happen on every group_interval tick.

Now suppose that both alert A and alert B resolve at more or less the same time (because the host is back up along with its special daemon). Alertmanager doesn't immediately send notifications for resolved alerts; instead, just like all other alert group re-notifications, they wait for the next group_interval tick. When this tick happens, alert B will be a resolved alert that you should normally be notified about, and alert A will no longer be active and so no longer inhibiting it. You'll receive a potentially surprising notification about the now-resolved alert B, even though it was previously inhibited while it was active (and so you may never have received an initial notification that it was active).

(Although I described it as both alerts resolving around the same time, it doesn't have to be that way; alert A might have ended later than B, with some hand-waving and uncertainty. The necessary condition is for alert A and its inhibition to no longer be in effect when Alertmanager is about to process a notification that includes alert B's resolution.)

The consequence of this is that if you want inhibitions to reliably suppress notification about resolved alerts, you need the inhibiting alert to be active at least one group_interval longer than the alerts it's inhibiting. In some cases this is easy to arrange, but in other cases it may be troublesome and so you may want to simply live with the extra notifications about resolved alerts.

(The longer your 'group_interval' setting is, the worse this gets, but there are a number of reasons you probably want group_interval to be relatively short, including prompt notifications about resolved alerts under normal circumstances.)

Written on 02 April 2024.
« What Prometheus Alertmanager's group_interval setting means
GNU Emacs and the case of special space characters »

Page tools: View Source.
Search:
Login: Password:

Last modified: Tue Apr 2 23:02:40 2024
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.