2023-04-08
A Prometheus Alertmanager alert grouping conundrum
We have various host-related alerts in our Prometheus and Alertmanger setup. Some of those are about things on the host not being right (full disks, for example, or network interfaces not being at the right speed), but some of them are alerts that fire if the host is down; for example there's alerts on ping failures, SSH connection failures, and the Prometheus host agent not responding. Unsurprisingly, we reboot our machines every so often and we don't like to get spammed with spurious alerts, so in our Alertmanager configuration we delay those alerts a bit so that they won't send us an alert if the machine is just rebooting. This looks like:
- match_re: alertname: 'NoPing|NoSSH|DownAgent' group_wait: 6m group_interval: 3m
We don't want to delay when these alerts start firing in Prometheus by
giving them a long for:
delay; we want the Prometheus version of their
state to reflect reality as we consider it. We also have some machines
that are sufficiently important and sufficiently rarely rebooted that we
don't wait this six minutes but instead alert almost immediately, such
as our ZFS fileservers.
However, this creates a little conundrum where that alert matching up there is actually a lie. We have a number of other alerts that will fire on some hosts if the host is down, for example if the host runs an additional agent. If we don't put these alerts in the alert matching, Alertmanager groups them separately and we get two separate alerts if the host genuinely goes down, one from this grouping of 'host has probably rebooted' alerts and one from the default grouping of other per-host alerts. This is an easy thing to overlook when creating new alerts; generally I find such alerts when such a host goes down and we get more alert messages than we should.
(Once we have more than a few of these 'host could be rebooting' alerts, it might be better to set a special label on all of them and then match on the label in Alertmanager. However, it becomes less immediately visible what all of the alerts are.)
However, just adding these extra alerts to the alertname match has a more subtle trap that can still cause us to get extra alerts, and that is alert activation time. If an additional alert is sufficiently slow to trigger (which isn't uncommon for alerts such as ones about additional agents being down), it will miss the six minute group wait interval, not be included in the initial alert sent to us about the host being down, and will be added in the next cycle of alert notices, giving us two alert notifications when a host is down. This too is easy to overlook, although once I realized it I added a comment to the Alertmanager stanza above about it, so I have a better chance of avoiding it in the future.
I could switch to having Alertmanager inhibit various extra alerts if a host is down, but I'm not sure that's the right approach. We do sort of want to know what other things we're missing if a host goes down, although at the same time some things are irrelevant (eg, that additional host-specific exporters are down). One tricky bit about this is that you can't make inhibitions depend on multiple alerts all being active, so I'd probably need to have Prometheus trigger a synthetic 'host is down' alert if all of the conditions are true.
One way to look at this situation is that it's happening because you can't have Alertmanager conditionally group alerts together; alert groupings are static things. This makes perfect sense and is a lot easier to implement (and it avoids all sorts of corner cases), but sometimes it means that alert grouping gets in the way of alert aggregation.