Why Prometheus turns out not be our ideal alerting system
What we want out of an alert system is relatively straightforward (and was probably once typically for sysadmins who ran machines). We would like to get notified once and only once for any new alert that shows up (and for some of them to get notified again when they go away), and we'd also like these alerts to be aggregated together to some degree so we aren't spammed to death if a lot of things go wrong at once.
(It would be ideal if the degree of aggregation was something we could control on the fly. If only a few machines have problems we probably want to get separate emails about each machine, but if a whole bunch of machines all suddenly have problems, please, just send us one email with everything.)
Unfortunately Prometheus doesn't do this, because its Alertmanager has a fundamentally different model of how alert notification should work. Alertmanager's core model is that instead of sending you new alerts, it will send you the entire current state of alerts any time that state changes. So, if you group alerts together and initially there are two alerts in a group and then a third shows up later, Alertmanager will first notify you about the initial two alerts and then later re-notify you with all three alerts. If one of the three alerts clears and you've asked to be notified about cleared alerts, you'll get another notification that lists the now-cleared alert and the two alerts that are still active. And so on.
(One way to put this is to say that Alertmanager is sort of level triggered instead of edge triggered.)
This is not a silly or stupid thing for Alertmanager to do, and it has some advantages; for instance, it means that you only need to read the most recent notification to get a full picture of everything that's currently wrong. But it also means that if you have an escalating situation, you may need to carefully read all of the alerts in each new notification to realize this, and in general you risk alert fatigue if you have a lot of alerts that are grouped together; sooner or later the long list of alerts is just going to blur together. Unfortunately this describes our situation, especially if we try to group things together broadly.
(Alertmanager also sort of assumes other things, for example that you have a 24/7 operations team who deal with issues immediately. If you always deal with issues when they come up, you don't need to hear about an alert clearing because you almost certainly caused that and if you didn't, you can see the new state on your dashboards. We're not on call 24/7 and even when we're around we don't necessarily react immediately, so it's quite possible for things to happen and then clear up without us even looking at anything. Hence our desire to hear about cleared alerts, which is not the Alertmanager default.)
I consider this an unfortunate limitation in Alertmanager. Alertmanager internally knows what alerts are new and changed (since that's part of what drives it to send new notifications), but it doesn't expose this anywhere that you can get at it, even in templating. However I suspect that the Prometheus people wouldn't be interested in changing this, since I expect that distinguishing between new and old alerts doesn't fit their model of how alerting should be done.
On a broader level, we're trying to push a round solution into a square hole and this is one of the resulting problems. Prometheus's documentation is explicit about the philosophy of alerting that it assumes; basically it wants you to have only a few alerts, based on user-visible symptoms. Because we look after physical hosts instead of services (and to the extent that we have services we have a fair amount of them), we have a lot of potential alerts about a lot of potential situations.
(Many of these situations are user visible, simply because users can see into a lot of our environment. Users will notice if any particular general access login or compute server goes down, for example, so we have to know about it too.)
Our current solution is to make do. By grouping alerts only on a per-host basis, we hope to keep the 'repeated alerts in new notifications' problem down to a level where we probably won't miss significant new problems, and we have some hacks to create one time notifications (basically, we make sure that some alerts just can't group together with anything else, which is more work than you'd think).
(It's my view that using Alertmanager to inhibit 'less severe' alerts in favour of more severe ones is not a useful answer for us for various reasons beyond the scope of this entry. Part of it is that I think maintaining suitable inhibition rules would take a significant amount of care in both the Alertmanager configuration and the Prometheus alert generation, because Alertmanager doesn't give you very much power for specifying what inhibits what.)
Sidebar: Why we're using Prometheus for alerting despite this
Basically, we don't want to run a second system just for alerting unless we really have to, especially since a certain number of alerts are naturally driven from information that Prometheus is collecting for metrics purposes. If we can make Prometheus work for alerting and it's not too bad, we're willing to live with the issues (at least so far).