It's the indirect failure modes that will get you

July 31, 2010

The University of Toronto's Internet link went down recently (well, became really slow and lossy, so we may just be being DDoS'd or something). I'm at home, so when I noticed the link problems I shrugged and carried on; it's not as if my home machine depends on stuff from work, so I didn't expect anything beyond the annoyance of not being able to get to work networks.

(Although the network being unreachable was going to be somewhat inconvenient, since I had a WanderingThoughts entry to write.)

Except that all of my web browsing was achingly slow. Epically, totally slow. Pages would only come up very slowly, or come up but the browser would say they were still loading. This was quite puzzling; my network link wasn't busy and it's not as if I proxy my web traffic through work. A check of my DNS setup confirmed that I was using my local caching DNS server and that server wasn't bouncing everything through work.

And then I looked at my DNS server's query logs:

[...] query [...] www.flickr.com.cs.toronto.edu.
[...] query [...] www.flickr.com.toronto.edu.
[...] query [...] www.flickr.com.

An uncomfortable light dawned. I had work's domains configured as my search domain list in /etc/resolv.conf and I had the ndots option set very high (for bad reasons), so every hostname resolution attempt was trying several university domains first. Normally I don't notice these because I promptly get negative answers from work's nameservers, but with the university's Internet link down those queries instead had to time out before the lookup could move on to trying the real name.

It turns out that modern web pages use a lot of different things from a lot of different domains. When each of these domains takes plural seconds to resolve, loading pages gets really slow. Slow on the initial load (as the browser resolves the actual website IP address) and then slow to finish, as the browser tries to fetch additional resource after additional resource.

This isn't a direct failure mode, where I was routing traffic through work; instead it was an indirect failure mode, where a couple of configuration options had an inobvious effect that was itself relatively invisible in normal operation. Direct failure modes are easy to see and relatively easy to remember; you can, for example, see that all of your traffic goes over your VPN to work, a VPN that is not working. Indirect failures are much less obvious and so are much more interesting (in the sense of causing excitement) and hard to notice in advance.

Sidebar: my ndots mistake

Many years ago when I first ran into the ndots option in resolv.conf, either it behaved differently than it does today or I just wound up with a mistaken impression about how it works. Back then, I believed that queries for names with at least ndots dots in them entirely ignored the resolv.conf search path and only ever looked up the absolute hostname. Since we love using abbreviated hostnames around here and local subdomains can have any number of dots in them, this implied that essentially no small value of ndots was safe. Thus I set a very large one and grumbled, and carried all of this forward when I configured my home machine.

This is not how ndots works today; today, ndots just sets the point at which the resolver will try an absolute hostname before trying your search path instead of only trying an absolute hostname only after running all the way through it. This is safe, and implies that an ndots of 2 is generally what I want (since I make frequent use of '<host>.<subdomain>' to refer to various machines at work).

Written on 31 July 2010.
« A little modern Unix twitch
The other peculiar effects of grant funding at universities »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sat Jul 31 02:53:15 2010
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.