How I managed to shoot myself in the foot with my local DNS resolver

October 16, 2016

I have my home machine's Twitter client configured so that it opens links in my always-running Firefox, and in fact there's a whole complicated lashup of shell scripting surrounding this in an attempt to the right thing with various sorts of links. For the past little while, clicking on some of those links has often (although not always) been very slow to take effect; I'd click a link and it'd be several seconds before I got my new browser window. In the beginning I wrote this off as just Twitter being slow (which it sometimes is) and didn't think too much about it. Today this got irritating enough that I decided to investigate a bit, so I ran Dave Cheney's httpstat against twitter.com, expecting to see that all the delay was in either connecting to Twitter or in getting content back.

(To be honest, I expected that this was something to do with IPv6, as has happened before. My home IPv6 routing periodically breaks or malfunctions even when my IPv4 routing is fine.)

To my surprise, httpstat reported that it'd spent just over 5000 milliseconds in DNS lookup. So much for blaming anyone else; DNS lookup delays are pretty much all my fault, since I run a local caching resolver. I promptly started looking at my configuration and soon found the problem, which comes in two parts.

First, I had (and have) my /etc/resolv.conf configured with a non-zero ndots setting and several search (sub)domains. This is for good historical reasons, since it lets me do things like 'ssh apps0.cs' instead of having to always specify the long fully qualified domain. However, this means that every reasonably short website name, like twitter.com, was being checked to see if it was actually a university host like twitter.com.utoronto.ca. Of course it isn't, but that means that I was querying our DNS servers quite a lot, even for lookups that I conceptually thought of having nothing to do with the university.

Second, my home Unbound setup is basically a copy of my work Unbound setup, and when I set it up (and copied it) I deliberately configured explicit Unbound stub zones for the university's top level domain that pointed to our nameservers. At work, the intent of this was to be able to resolve in-university hostnames even if our Internet link went down. At home, well, I was copying the work configuration because that was easy and what was the harm in short-cutting lookups this way?

In case you are ever tempted to this, the answer is that you have to be careful to keep your list of stub zone nameservers up to date, and of course I hadn't. As long as my configuration didn't break spectacularly I didn't give it any thought, and it turned out that one of the IP addresses I had listed as a stub-addr server doesn't respond to me at all any more (and some of the others may not have been entirely happy with me). If Unbound decided to send a query for twitter.com.utoronto.ca to that IP, well, it was going to be waiting for a timeout. No wonder I periodically saw odd delays like this (and stalls when I was trying to pull from or check github.com, and so on).

(Twitter makes this much more likely by having an extremely short TTL on their A records, so they fell out of Unbound's cache on a regular basis and had to be re-queried.)

I don't know if short-cut stub zones for the university's forward and reverse DNS is still a sensible configuration for my office workstation's Unbound, but it definitely isn't for home usage. If the university's Internet link is down, well, I'm outside it at home; I'm not reaching any internal servers for either DNS lookups or connections. So I've wound up taking it out of my home configuration and looking utoronto.ca names up just like any other domain.

(This elaborates on a Tweet of mine.)

Sidebar: The situation gets more mysterious

It's possible that this is actually a symptom of more than me just setting up a questionable caching DNS configuration and then failing to maintain and update it. In the process of writing this entry I decided to take another look at various university DNS data, and it turns out that the non-responding IP address I had in my Unbound configuration is listed as an official NS record for various university subdomains (including some that should be well maintained). So it's possible that something in the university's DNS infrastructure has fallen over or become incorrect without having been noticed.

(I wouldn't say that my Unbound DNS configuration was 'right', at least at home, but it does mean that my configuration might have kept working smoothly if not for this broader issue.)

Written on 16 October 2016.
« ZFS's 'panic on on-disk corruption' behavior is a serious flaw
Why we care about long uptimes »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sun Oct 16 02:17:43 2016
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.