Wandering Thoughts archives

2016-10-16

Why we care about long uptimes

Here's a question: why should we care about long uptimes, especially if we have to get these long uptimes in somewhat artificial situations like not applying updates?

(I mean, sysadmins like boasting about long uptimes, but this is just boasting. And we shouldn't make long uptimes a fetish.)

One answer is certainly 'keeping your system up avoids disrupting users'. Of course there are many other ways to achieve this, such as redundancy and failure-resistant environments. The whole pets versus cattle movement is in part about making single machine uptime unimportant; you achieve your user visible uptime by a resilient environment that can deal with all sorts of failures, instead of heroic (and artificial) efforts to keep single machines from rebooting or single services from restarting.

(Note that not all environments can work this way, although ours may be an extreme case.)

My answer is that long uptimes demonstrate that our systems are fundamentally stable. If you can keep a system up and stable for a long time, you've shown that (in your usage) it doesn't have issues like memory leaks, fragmentation, lurking counter rollover problems, and so on. Even very small issues here can destabilize your system over a span of months or years, so a multi-year uptime is a fairly strong demonstration that you don't have these problems. And this matters because it means that any instability problems in the environment are introduced by us, and that means we can control them and schedule them and so on.

A system that lacks this stability is one where at a minimum you're forced to schedule regular service restarts (or system reboots) in order to avoid unplanned or unpleasant outages when the accumulated slow problems grow too big. At the worst, you have unplanned outages or service/system restarts when the system runs itself into the ground. You can certainly deal with this with things like auto-restarted programs and services, deadman timers to force automated reboots, and so on, but it's less than ideal. We'd like fundamentally stable systems because they provide a strong base to build on top of.

So when I say 'our iSCSI backends have been up for almost two years', what I'm really saying is 'we've clearly managed to build an extremely stable base for our fileserver environment'. And that's a good thing (and not always the case).

sysadmin/LongUptimesImportance written at 23:55:29; Add Comment

How I managed to shoot myself in the foot with my local DNS resolver

I have my home machine's Twitter client configured so that it opens links in my always-running Firefox, and in fact there's a whole complicated lashup of shell scripting surrounding this in an attempt to the right thing with various sorts of links. For the past little while, clicking on some of those links has often (although not always) been very slow to take effect; I'd click a link and it'd be several seconds before I got my new browser window. In the beginning I wrote this off as just Twitter being slow (which it sometimes is) and didn't think too much about it. Today this got irritating enough that I decided to investigate a bit, so I ran Dave Cheney's httpstat against twitter.com, expecting to see that all the delay was in either connecting to Twitter or in getting content back.

(To be honest, I expected that this was something to do with IPv6, as has happened before. My home IPv6 routing periodically breaks or malfunctions even when my IPv4 routing is fine.)

To my surprise, httpstat reported that it'd spent just over 5000 milliseconds in DNS lookup. So much for blaming anyone else; DNS lookup delays are pretty much all my fault, since I run a local caching resolver. I promptly started looking at my configuration and soon found the problem, which comes in two parts.

First, I had (and have) my /etc/resolv.conf configured with a non-zero ndots setting and several search (sub)domains. This is for good historical reasons, since it lets me do things like 'ssh apps0.cs' instead of having to always specify the long fully qualified domain. However, this means that every reasonably short website name, like twitter.com, was being checked to see if it was actually a university host like twitter.com.utoronto.ca. Of course it isn't, but that means that I was querying our DNS servers quite a lot, even for lookups that I conceptually thought of having nothing to do with the university.

Second, my home Unbound setup is basically a copy of my work Unbound setup, and when I set it up (and copied it) I deliberately configured explicit Unbound stub zones for the university's top level domain that pointed to our nameservers. At work, the intent of this was to be able to resolve in-university hostnames even if our Internet link went down. At home, well, I was copying the work configuration because that was easy and what was the harm in short-cutting lookups this way?

In case you are ever tempted to this, the answer is that you have to be careful to keep your list of stub zone nameservers up to date, and of course I hadn't. As long as my configuration didn't break spectacularly I didn't give it any thought, and it turned out that one of the IP addresses I had listed as a stub-addr server doesn't respond to me at all any more (and some of the others may not have been entirely happy with me). If Unbound decided to send a query for twitter.com.utoronto.ca to that IP, well, it was going to be waiting for a timeout. No wonder I periodically saw odd delays like this (and stalls when I was trying to pull from or check github.com, and so on).

(Twitter makes this much more likely by having an extremely short TTL on their A records, so they fell out of Unbound's cache on a regular basis and had to be re-queried.)

I don't know if short-cut stub zones for the university's forward and reverse DNS is still a sensible configuration for my office workstation's Unbound, but it definitely isn't for home usage. If the university's Internet link is down, well, I'm outside it at home; I'm not reaching any internal servers for either DNS lookups or connections. So I've wound up taking it out of my home configuration and looking utoronto.ca names up just like any other domain.

(This elaborates on a Tweet of mine.)

Sidebar: The situation gets more mysterious

It's possible that this is actually a symptom of more than me just setting up a questionable caching DNS configuration and then failing to maintain and update it. In the process of writing this entry I decided to take another look at various university DNS data, and it turns out that the non-responding IP address I had in my Unbound configuration is listed as an official NS record for various university subdomains (including some that should be well maintained). So it's possible that something in the university's DNS infrastructure has fallen over or become incorrect without having been noticed.

(I wouldn't say that my Unbound DNS configuration was 'right', at least at home, but it does mean that my configuration might have kept working smoothly if not for this broader issue.)

sysadmin/LocalDNSConfigurationFumble written at 02:17:43; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.