Wandering Thoughts archives

2024-04-02

An issue with Alertmanager inhibitions and resolved alerts

Prometheus Alertmanager has a feature called inhibitions, where one alert can inhibit other alerts. We use this in a number of situations, such as our special 'there is a large scale problem' alert inhibiting other alerts and some others. Recently I realized that there is a complication in how inhibitions interact with being notified about resolved alerts (due to this mailing list thread).

Suppose that you have an inhibition rule to the effect that alert A ('this host is down') inhibits alert B ('this special host daemon is down'), and you send notifications on resolved alerts. With alert A in effect, every time Alertmanager goes to send out a notification for the alert group that alert B is part of, Alertmanager will see that alert B is inhibited and filter it out (as far as I can tell this is the basic effect of Alertmanager silences, inhibitions, and mutes). Such notifications will (potentially) happen on every group_interval tick.

Now suppose that both alert A and alert B resolve at more or less the same time (because the host is back up along with its special daemon). Alertmanager doesn't immediately send notifications for resolved alerts; instead, just like all other alert group re-notifications, they wait for the next group_interval tick. When this tick happens, alert B will be a resolved alert that you should normally be notified about, and alert A will no longer be active and so no longer inhibiting it. You'll receive a potentially surprising notification about the now-resolved alert B, even though it was previously inhibited while it was active (and so you may never have received an initial notification that it was active).

(Although I described it as both alerts resolving around the same time, it doesn't have to be that way; alert A might have ended later than B, with some hand-waving and uncertainty. The necessary condition is for alert A and its inhibition to no longer be in effect when Alertmanager is about to process a notification that includes alert B's resolution.)

The consequence of this is that if you want inhibitions to reliably suppress notification about resolved alerts, you need the inhibiting alert to be active at least one group_interval longer than the alerts it's inhibiting. In some cases this is easy to arrange, but in other cases it may be troublesome and so you may want to simply live with the extra notifications about resolved alerts.

(The longer your 'group_interval' setting is, the worse this gets, but there are a number of reasons you probably want group_interval to be relatively short, including prompt notifications about resolved alerts under normal circumstances.)

sysadmin/AlertmanagerInhibitionsGotcha written at 23:02:40;

What Prometheus Alertmanager's group_interval setting means

One of the configuration settings in Prometheus Alertmanager for 'routes' is the alert group interval, the 'group_interval' setting. The Alertmanager configuration describes the setting this way:

How long to wait before sending a notification about new alerts that are added to a group of alerts for which an initial notification has already been sent.

As has come up before more than once, this is not actually accurate. The group interval is not a (minimum) delay; it is instead a timer that ticks every so often (a ticker). If you have group_interval set to five minutes, Alertmanager will potentially send another notification only at every five minute interval after the first notification (what I'll call a tick). If the initial notification happened at 12:10, the first re-notification might happen at 12:15, and then at 12:20, and then at 12:25, and so on.

(The timing of these ticks is based purely on when the first notification for an alert group is sent, so usually they will not be so neatly lined up with the clock.)

If a new alert (or a resolved alert) misses the group_interval tick by even a second, a notification including it won't go out until the next tick. If the initial alert group notification happened at 12:10 and then nothing changed until a new alert was raised at 12:31, Alertmanager will not send another notification until the group_interval tick at 12:35, even though it's been much more than five minutes since the last notification.

This gives you an unfortunate tradeoff between prompt notification of additional alerts in an alert group (or of alerts being resolved) and not receiving a horde of notifications. If you want to receive a prompt notification, you need a short group_interval, but then you can receive a stream of notifications as alert after alert after alert pops up one by one. It would be nicer if Alertmanager didn't have this group_interval tick behavior but would instead treat it as a minimum delay between successive notifications, but I don't expect Alertmanager to change at this point.

(I've written all of this down before in various entries, so this is mostly to have a single entry I can link to in the future when group_interval comes up.)

sysadmin/AlertmanagerGroupInterval written at 20:43:46;


Page tools: See As Normal.
Search:
Login: Password:

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.