It's the indirect failure modes that will get you
The University of Toronto's Internet link went down recently (well, became really slow and lossy, so we may just be being DDoS'd or something). I'm at home, so when I noticed the link problems I shrugged and carried on; it's not as if my home machine depends on stuff from work, so I didn't expect anything beyond the annoyance of not being able to get to work networks.
(Although the network being unreachable was going to be somewhat inconvenient, since I had a WanderingThoughts entry to write.)
Except that all of my web browsing was achingly slow. Epically, totally slow. Pages would only come up very slowly, or come up but the browser would say they were still loading. This was quite puzzling; my network link wasn't busy and it's not as if I proxy my web traffic through work. A check of my DNS setup confirmed that I was using my local caching DNS server and that server wasn't bouncing everything through work.
And then I looked at my DNS server's query logs:
[...] query [...] www.flickr.com.cs.toronto.edu.
[...] query [...] www.flickr.com.toronto.edu.
[...] query [...] www.flickr.com.
An uncomfortable light dawned. I had work's domains configured as my
search domain list in
/etc/resolv.conf and I had the
set very high (for bad reasons), so every hostname resolution attempt
was trying several university domains first. Normally I don't notice
these because I promptly get negative answers from work's nameservers,
but with the university's Internet link down those queries instead had
to time out before the lookup could move on to trying the real name.
It turns out that modern web pages use a lot of different things from a lot of different domains. When each of these domains takes plural seconds to resolve, loading pages gets really slow. Slow on the initial load (as the browser resolves the actual website IP address) and then slow to finish, as the browser tries to fetch additional resource after additional resource.
This isn't a direct failure mode, where I was routing traffic through work; instead it was an indirect failure mode, where a couple of configuration options had an inobvious effect that was itself relatively invisible in normal operation. Direct failure modes are easy to see and relatively easy to remember; you can, for example, see that all of your traffic goes over your VPN to work, a VPN that is not working. Indirect failures are much less obvious and so are much more interesting (in the sense of causing excitement) and hard to notice in advance.
Many years ago when I first ran into the
ndots option in resolv.conf,
either it behaved differently than it does today or I just wound up
with a mistaken impression about how it works. Back then, I believed
that queries for names with at least
ndots dots in them entirely
ignored the resolv.conf search path and only ever looked up the absolute
hostname. Since we love using abbreviated hostnames around here and
local subdomains can have any number of dots in them, this implied that
essentially no small value of
ndots was safe. Thus I set a very large
one and grumbled, and carried all of this forward when I configured my
This is not how
ndots works today; today,
ndots just sets the point
at which the resolver will try an absolute hostname before trying your
search path instead of only trying an absolute hostname only after
running all the way through it. This is safe, and implies that an
ndots of 2 is generally what I want (since I make frequent use of
'<host>.<subdomain>' to refer to various machines at work).