A lesson of (alert) scale we learned from a power failure

August 26, 2019

Starting last November, we moved over to a new metrics, monitoring, and alerting system based around Prometheus. Prometheus's Alertmanager allows you to group alerts together in various ways, but what it supports is not ideal for us and once the dust settled we decided that the best we could do was to group our alerts by host. In practice, hosts are both what we maintain and usually what breaks. And usually their problems are independent of each other.

Then we had a power failure and our DNS servers failed to come back into service. All of our Prometheus scraping and monitoring was done by host name, and 'I cannot resolve this host name' causes Prometheus to consider that the scrape or check has failed. Pretty much the moment the Prometheus server host rebooted, essentially all of our checks started failing and triggering alerts, and eventually as we started to get the DNS servers up the resulting email could actually be delivered.

When the dust settled, we had received an impressive amount of email from Alertmanager (and a bunch of other system email, too, reporting things like cron job failures); my mail logs say we got over 700 messages all told. Needless to say, this much email is not useful; in fact, it's harmful. Instead of alert email pointing out problems, it was drowning us in noise; we had to ignore it and mass-delete it just to control our mailboxes.

I'd always known that this was a potential problem in our setup, but I didn't expect it to be that much of a problem (or to come up that soon). In the aftermath of the power failure, it was clear that we needed to control alert volume during anything larger than a small scale outage. Even if we'd only received one email message per host we monitored, it could still rapidly escalate to too many. By the time we're getting ten or fifteen email messages all of a sudden, they're pretty much noise. We have a problem and we know it; the exhaustive details are no longer entirely useful, especially if delivered in bits and pieces.

I took two lessons from this experience. The first is the obvious one, which is that you should consider what happens to your monitoring and alerting system if a lot of things go wrong, and think about how to deal with that. It's not an easy problem, because what you want when there's only a few things wrong is different from what you want when there's a lot of them, and how your alerting system is going to behave when things go very wrong is not necessarily easy to predict.

(I'm not sure if our alerts flapped or some of them failed to group together the way I expected them to, or both. Either way we got a lot more email than I'd have predicted.)

The second lesson is that large scale failures are perhaps more likely and less conveniently timed than you'd like, so it's worth taking at least some precautions to deal with them before you think you really need to. One reason to act ahead of time here is that a screaming alert system can easily make a bad situation worse. You may also want to err on the side of silence. In some ways it's better to get no alerts during a large scale failure than too many, since you probably already know that you have a big problem.

(This sort of elaborates on a toot of mine.)

Sidebar: How we now deal with this

Nowadays we have a special 'there is a large scale problem' alert that shuts everything else up for the duration, and to go with it a 'large scale outages' Grafana dashboard that is mostly text tables to list down machines, active alerts, failing checks, other problems, and so on.

(We built a dedicated dashboard for this because our normal overview dashboard isn't really designed to deal with a lot of things being down; it's more focused on the routine situation that nothing or almost nothing is down and you want an overview of how things are going. So, for example, it doesn't bother having very large space to list down hosts and active alerts, because most of the time that would be empty wasted space.)


Comments on this page:

From 69.165.143.252 at 2019-08-27 17:46:12:

Our alerting system actually runs a DNS slave of all of our internal domains: resolv.conf points first to 127.0.0.1, and then to the internal DNS servers that all other hosts use (in case we do a local restart/upgrade).

We do a service check on our DNS servers of course.

By Perry Lorier at 2019-08-28 04:12:01:

A variation of this theme is "my monitoring stack is broken, and thus it says everything is broken". In a previous team, we had the "master connectivity ping". The monitoring box would probe 3 or 4 things that should always be up, if all of them are down, then it fires the "master connectivity alert" and inhibits all the other alerts generated from that box. If any of them are up, then we assume the monitoring is fine and continue from there.

Your version seems to be isomorphic (just using all of it's normal probes as the master connectivity pings).

By cks at 2019-08-28 09:34:42:

Our large scale problems alert isn't quite isometric to your master connectivity ping, because it (deliberately) triggers for any sufficiently large set of problems, even if not everything is affected. It's basically a 'this would be too much email' alert. As a result of our experiences, our current thresholds for it are set well below the level of 'everything seems to be broken'.

(More than a handful of machines being down or having problems is too much email for per-host alert emails, at least for us.)

Written on 26 August 2019.
« Text UIs and the problem of discoverability
Allowing some Alias directives to override global Redirects in Apache »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Aug 26 21:58:58 2019
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.