Wandering Thoughts archives

2019-08-26

A lesson of (alert) scale we learned from a power failure

Starting last November, we moved over to a new metrics, monitoring, and alerting system based around Prometheus. Prometheus's Alertmanager allows you to group alerts together in various ways, but what it supports is not ideal for us and once the dust settled we decided that the best we could do was to group our alerts by host. In practice, hosts are both what we maintain and usually what breaks. And usually their problems are independent of each other.

Then we had a power failure and our DNS servers failed to come back into service. All of our Prometheus scraping and monitoring was done by host name, and 'I cannot resolve this host name' causes Prometheus to consider that the scrape or check has failed. Pretty much the moment the Prometheus server host rebooted, essentially all of our checks started failing and triggering alerts, and eventually as we started to get the DNS servers up the resulting email could actually be delivered.

When the dust settled, we had received an impressive amount of email from Alertmanager (and a bunch of other system email, too, reporting things like cron job failures); my mail logs say we got over 700 messages all told. Needless to say, this much email is not useful; in fact, it's harmful. Instead of alert email pointing out problems, it was drowning us in noise; we had to ignore it and mass-delete it just to control our mailboxes.

I'd always known that this was a potential problem in our setup, but I didn't expect it to be that much of a problem (or to come up that soon). In the aftermath of the power failure, it was clear that we needed to control alert volume during anything larger than a small scale outage. Even if we'd only received one email message per host we monitored, it could still rapidly escalate to too many. By the time we're getting ten or fifteen email messages all of a sudden, they're pretty much noise. We have a problem and we know it; the exhaustive details are no longer entirely useful, especially if delivered in bits and pieces.

I took two lessons from this experience. The first is the obvious one, which is that you should consider what happens to your monitoring and alerting system if a lot of things go wrong, and think about how to deal with that. It's not an easy problem, because what you want when there's only a few things wrong is different from what you want when there's a lot of them, and how your alerting system is going to behave when things go very wrong is not necessarily easy to predict.

(I'm not sure if our alerts flapped or some of them failed to group together the way I expected them to, or both. Either way we got a lot more email than I'd have predicted.)

The second lesson is that large scale failures are perhaps more likely and less conveniently timed than you'd like, so it's worth taking at least some precautions to deal with them before you think you really need to. One reason to act ahead of time here is that a screaming alert system can easily make a bad situation worse. You may also want to err on the side of silence. In some ways it's better to get no alerts during a large scale failure than too many, since you probably already know that you have a big problem.

(This sort of elaborates on a toot of mine.)

Sidebar: How we now deal with this

Nowadays we have a special 'there is a large scale problem' alert that shuts everything else up for the duration, and to go with it a 'large scale outages' Grafana dashboard that is mostly text tables to list down machines, active alerts, failing checks, other problems, and so on.

(We built a dedicated dashboard for this because our normal overview dashboard isn't really designed to deal with a lot of things being down; it's more focused on the routine situation that nothing or almost nothing is down and you want an overview of how things are going. So, for example, it doesn't bother having very large space to list down hosts and active alerts, because most of the time that would be empty wasted space.)

AlertExplosionLessonLearned written at 21:58:58; Add Comment

2019-08-09

Turning off DNSSEC in my Unbound instances

I tweeted:

It has been '0' days since DNSSEC caused DNS resolution for perfectly good DNS names to fail on my machine. Time to turn DNSSEC validation off, which I should have done long ago.

I use Unbound on my machines, from the Fedora package, so this is not some questionable local resolver implementation getting things wrong; this is a genuine DNSSEC issue. In my case, it was for www.linuxjournal.com, which is in my sources of news because it's shutting down. When I tried to visit it from my home machine, I couldn't get an answer for its IP address. Turning on verbose Unbound logging gave me a great deal of noise, in which I could barely make out that Unbound was able to obtain A and AAAA records but then was going on to try DNSSEC and clearly something was going wrong. Turning of DNSSEC fixed it, once I did it in the right way.

NLNet Labs has a Howto on turning off DNSSEC in Unbound that provides a variety of ways to do this, starting from setting 'val-permissive-mode: yes' all the way up to disabling the validator module. My configuration has had permissive mode set to yes for years, but that was apparently not good enough to deal with this situation, so I have now removed the validator module from my Unbound module configuration. In fact I have minimized it compared to the Fedora version.

The Fedora 29 default configuration for Unbound modules is:

module-config: "ipsecmod validator iterator"

I had never heard of 'ipsecmod' before, but it turns out to be 'opportunistic IPSec support', as described in the current documentation for unbound.conf; I will let you read the details there. Although configured as a module in the Fedora version, it is not enabled ('ipsecmod-enabled' is set off); however, I have a low enough opinion of unprompted IPSec to random strangers that I removed the module entirely, just in case. So my new module config is just:

module-config: "iterator"

(Possibly I could take that out too and get better performance.)

In the Fedora Unbound configuration, this can go in a new file in /etc/unbound/local.d. I called my new file 'no-dnssec.conf'.

(There were a variety of frustrating aspects to this experience and I have some opinions on DNSSEC as a whole, but those are for another entry.)

UnboundNoDNSSEC written at 20:56:17; Add Comment

2019-08-01

How not to set up your DNS (part 24)

I'll start with the traditional illustration of DNS results:

; dig +short mx officedepot.se.
50 odmailgate.officedepot.com.
40 officedepot.com.s10b2.psmtp.com.
30 officedepot.com.s10b1.psmtp.com.
20 officedepot.com.s10a2.psmtp.com.
10 officedepot.com.s10a1.psmtp.com.

What I can't easily illustrate is that none of the hostnames under psmtp.com exist. Instead, it seems that officedepot.com has shifted its current mail handling to outlook.com, based on their current MX. While odmailgate.officedepot.com resolves to an IP address, 205.157.110.104, that IP address does not respond on port 25 and may not even be in service any more.

(It is not Office Depot's problem that we're trying to mail officedepot.se, of course; it is due to a prolific spammer hosted out of scaleway.com that is forging the envelope sender of their spam email from 'bounce@<various domains>', including officedepot.se and 'cloud.scaleway.com'.)

This does point out an interesting risk factor in shifting your mail system handling when you have a lot of domains, possibly handled by different groups of people. In an ideal world you would remember all of the domains that you accept mail for and get in touch with the people who handle their DNS to change everything, but in this world things can fall through the cracks. I suspect it's especially likely to happen for places that have enough domains that handling adding and removing them has been automated.

(It's been a while since the last installment; for various reasons I don't notice other people's DNS issues very often these days. I actually ran across a DNS issue in 2017 that I was going to post, but I ran into this issue and never finished the entry.)

HowNotToDoDNSXXIV written at 12:16:04; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.