== It's the indirect failure modes that will get you The University of Toronto's Internet link went down recently (well, became really slow and lossy, so we may just be being DDoS'd or something). I'm at home, so when I noticed the link problems I shrugged and carried on; it's not as if my home machine depends on stuff from work, so I didn't expect anything beyond the annoyance of not being able to get to work networks. (Although the network being unreachable was going to be somewhat inconvenient, since I had a WanderingThoughts entry to write.) Except that all of my web browsing was achingly slow. Epically, totally slow. Pages would only come up very slowly, or come up but the browser would say they were still loading. This was quite puzzling; my network link wasn't busy and it's not as if I proxy my web traffic through work. A check of my DNS setup confirmed that I was using my local caching DNS server and that server wasn't bouncing everything through work. And then I looked at my DNS server's query logs: > _[...] query [...] www.flickr.com.cs.toronto.edu._ \\ > _[...] query [...] www.flickr.com.toronto.edu._ \\ > _[...] query [...] www.flickr.com._ An uncomfortable light dawned. I had work's domains configured as my search domain list in _/etc/resolv.conf_ and I had the _ndots_ option set very high (for bad reasons), so every hostname resolution attempt was trying several university domains first. Normally I don't notice these because I promptly get negative answers from work's nameservers, but with the university's Internet link down those queries instead had to time out before the lookup could move on to trying the real name. It turns out that modern web pages use a lot of different things from a lot of different domains. When each of these domains takes plural seconds to resolve, loading pages gets really slow. Slow on the initial load (as the browser resolves the actual website IP address) and then slow to finish, as the browser tries to fetch additional resource after additional resource. This isn't a direct failure mode, where I was routing traffic through work; instead it was an indirect failure mode, where a couple of configuration options had an inobvious effect that was itself relatively invisible in normal operation. Direct failure modes are easy to see and relatively easy to remember; you can, for example, see that all of your traffic goes over your VPN to work, a VPN that is not working. Indirect failures are much less obvious and so are much more interesting (in the sense of causing excitement) and hard to notice in advance. === Sidebar: my _ndots_ mistake Many years ago when I first ran into the _ndots_ option in resolv.conf, either it behaved differently than it does today or I just wound up with a mistaken impression about how it works. Back then, I believed that queries for names with at least _ndots_ dots in them entirely ignored the resolv.conf search path and only ever looked up the absolute hostname. Since we love using abbreviated hostnames around here and local subdomains can have any number of dots in them, this implied that essentially no small value of _ndots_ was safe. Thus I set a very large one and grumbled, and carried all of this forward when I configured my home machine. This is not how _ndots_ works today; today, _ndots_ just sets the point at which the resolver will try an absolute hostname before trying your search path instead of only trying an absolute hostname only after running all the way through it. This is safe, and implies that an _ndots_ of 2 is generally what I want (since I make frequent use of '.' to refer to various machines at work).