A lesson of (alert) scale we learned from a power failure
Starting last November, we moved over to a new metrics, monitoring, and alerting system based around Prometheus. Prometheus's Alertmanager allows you to group alerts together in various ways, but what it supports is not ideal for us and once the dust settled we decided that the best we could do was to group our alerts by host. In practice, hosts are both what we maintain and usually what breaks. And usually their problems are independent of each other.
Then we had a power failure and our DNS servers failed to come back into service. All of our Prometheus scraping and monitoring was done by host name, and 'I cannot resolve this host name' causes Prometheus to consider that the scrape or check has failed. Pretty much the moment the Prometheus server host rebooted, essentially all of our checks started failing and triggering alerts, and eventually as we started to get the DNS servers up the resulting email could actually be delivered.
When the dust settled, we had received an impressive amount of email from Alertmanager (and a bunch of other system email, too, reporting things like cron job failures); my mail logs say we got over 700 messages all told. Needless to say, this much email is not useful; in fact, it's harmful. Instead of alert email pointing out problems, it was drowning us in noise; we had to ignore it and mass-delete it just to control our mailboxes.
I'd always known that this was a potential problem in our setup, but I didn't expect it to be that much of a problem (or to come up that soon). In the aftermath of the power failure, it was clear that we needed to control alert volume during anything larger than a small scale outage. Even if we'd only received one email message per host we monitored, it could still rapidly escalate to too many. By the time we're getting ten or fifteen email messages all of a sudden, they're pretty much noise. We have a problem and we know it; the exhaustive details are no longer entirely useful, especially if delivered in bits and pieces.
I took two lessons from this experience. The first is the obvious one, which is that you should consider what happens to your monitoring and alerting system if a lot of things go wrong, and think about how to deal with that. It's not an easy problem, because what you want when there's only a few things wrong is different from what you want when there's a lot of them, and how your alerting system is going to behave when things go very wrong is not necessarily easy to predict.
(I'm not sure if our alerts flapped or some of them failed to group together the way I expected them to, or both. Either way we got a lot more email than I'd have predicted.)
The second lesson is that large scale failures are perhaps more likely and less conveniently timed than you'd like, so it's worth taking at least some precautions to deal with them before you think you really need to. One reason to act ahead of time here is that a screaming alert system can easily make a bad situation worse. You may also want to err on the side of silence. In some ways it's better to get no alerts during a large scale failure than too many, since you probably already know that you have a big problem.
(This sort of elaborates on a toot of mine.)
Sidebar: How we now deal with this
Nowadays we have a special 'there is a large scale problem' alert that shuts everything else up for the duration, and to go with it a 'large scale outages' Grafana dashboard that is mostly text tables to list down machines, active alerts, failing checks, other problems, and so on.
(We built a dedicated dashboard for this because our normal overview dashboard isn't really designed to deal with a lot of things being down; it's more focused on the routine situation that nothing or almost nothing is down and you want an overview of how things are going. So, for example, it doesn't bother having very large space to list down hosts and active alerts, because most of the time that would be empty wasted space.)