Wandering Thoughts archives

2024-03-28

The effects of silences (et al) in Prometheus Alertmanager

Prometheus Alertmanager has various features that make it 'silence' alerts. Alerts can be inhibited by other alerts, they can be explicitly silenced, and a route can be muted at certain times or only active at certain times. The Alertmanager documentation generally describes all of these as "suppressing notifications" or causing a route to "not send any notifications". However, this limited description is what I would call under-specified, because there are some questions to ask about exactly what happens when you 'silence' alerts. As of Alertmanager 0.27.0, its actual behavior is somewhat complex and definitely hard to understand.

There are two pieces of behavior that seem straightforward:

  • if an alert starts within the silence and is still in effect at the end, its alert group will receive a new notification at its next group_interval point; this notification will include the new alert (or alerts).

  • if an alert group (of one or more alerts) is created within the silence and all of its alerts end sufficiently before the end of the silence, you will get no notification about the alert group.

The area with big question marks is notifications about resolved alerts (if you have Alertmanager set to send notifications on them at all). If the alert resolves sufficiently early, well before the end of the silence, you appear to get no notification for it. If the alert resolves close enough to the end of the silence and its alert group still has active alerts, you will sometimes get an alert group notification that includes the resolved alert. Sometimes this notification will come immediately, and sometimes it seems to only come if the alert group experiences another change in alert status sufficiently soon after the silence has ended.

(There are a lot of variables here and I haven't experimented extensively. Generally I think the sooner that Alertmanager has some reason to send a notification for the alert group, the higher your chances of hearing about resolved alerts are. One source of such a notification is if there are active alerts that started within the silence.)

What I believe is happening is that Alertmanager is keeping track of what alerts have had notifications delivered about them (through a specific receiver), so that Alertmanager can tell if there are new alerts in an alert group that would cause it to send a notification at the next group_interval point. When a silence, mute, or inhibition is in effect, no affected alerts are marked as 'delivered (to receiver X)'. When the silence ends, any such unmarked alerts that still exist are (once again) considered to be undelivered new alerts and will prompt an alert group notification at the alert group's next group_interval point, just as if they had suddenly shown up after the silence ended.

The complication is resolved alerts, because I believe that resolved alerts only linger in Alertmanager for a certain amount of time. After that time they are quietly removed. If an alert is resolved sufficiently early before the end of the silence, this linger time will end before the silence does and the resolved alert will disappear before its new status could trigger any notifications. If the alert is resolved sufficiently close to the end of the silence, it will still be in Alertmanager when notifications start happening again. I'm pretty sure this explanation is incomplete, but it at least gives me a starting point.

PS: Since all of this is under-documented, Alertmanager's behavior could change in the future, either deliberately or accidentally.

(This somewhat elaborates on some things I said on the Fediverse.)

sysadmin/AlertmanagerSilencesEffects written at 23:11:30; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.