Why we care about long uptimes
Here's a question: why should we care about long uptimes, especially if we have to get these long uptimes in somewhat artificial situations like not applying updates?
(I mean, sysadmins like boasting about long uptimes, but this is just boasting. And we shouldn't make long uptimes a fetish.)
One answer is certainly 'keeping your system up avoids disrupting users'. Of course there are many other ways to achieve this, such as redundancy and failure-resistant environments. The whole pets versus cattle movement is in part about making single machine uptime unimportant; you achieve your user visible uptime by a resilient environment that can deal with all sorts of failures, instead of heroic (and artificial) efforts to keep single machines from rebooting or single services from restarting.
My answer is that long uptimes demonstrate that our systems are fundamentally stable. If you can keep a system up and stable for a long time, you've shown that (in your usage) it doesn't have issues like memory leaks, fragmentation, lurking counter rollover problems, and so on. Even very small issues here can destabilize your system over a span of months or years, so a multi-year uptime is a fairly strong demonstration that you don't have these problems. And this matters because it means that any instability problems in the environment are introduced by us, and that means we can control them and schedule them and so on.
A system that lacks this stability is one where at a minimum you're forced to schedule regular service restarts (or system reboots) in order to avoid unplanned or unpleasant outages when the accumulated slow problems grow too big. At the worst, you have unplanned outages or service/system restarts when the system runs itself into the ground. You can certainly deal with this with things like auto-restarted programs and services, deadman timers to force automated reboots, and so on, but it's less than ideal. We'd like fundamentally stable systems because they provide a strong base to build on top of.
So when I say 'our iSCSI backends have been up for almost two years', what I'm really saying is 'we've clearly managed to build an extremely stable base for our fileserver environment'. And that's a good thing (and not always the case).
How I managed to shoot myself in the foot with my local DNS resolver
I have my home machine's Twitter client configured so that it opens
links in my always-running Firefox, and in fact there's a whole
complicated lashup of shell scripting surrounding this in an attempt
to the right thing with various sorts of links. For the past little
while, clicking on some of those links has often (although not
always) been very slow to take effect; I'd click a link and it'd
be several seconds before I got my new browser window. In the
beginning I wrote this off as just Twitter being slow (which it
sometimes is) and didn't think too much about it. Today this got
irritating enough that I decided to investigate a bit, so I ran
against twitter.com, expecting to see that all the delay was in
either connecting to Twitter or in getting content back.
(To be honest, I expected that this was something to do with IPv6, as has happened before. My home IPv6 routing periodically breaks or malfunctions even when my IPv4 routing is fine.)
To my surprise, httpstat reported that it'd spent just over 5000 milliseconds in DNS lookup. So much for blaming anyone else; DNS lookup delays are pretty much all my fault, since I run a local caching resolver. I promptly started looking at my configuration and soon found the problem, which comes in two parts.
First, I had (and have) my
/etc/resolv.conf configured with a
ndots setting and several search (sub)domains. This is
for good historical reasons, since it lets me do things like '
apps0.cs' instead of having to always specify the long fully
qualified domain. However, this means that every reasonably short
website name, like
twitter.com, was being checked to see if it
was actually a university host like
course it isn't, but that means that I was querying our DNS servers
quite a lot, even for lookups that I conceptually thought of having
nothing to do with the university.
Second, my home Unbound setup is basically a copy of my work Unbound setup, and when I set it up (and copied it) I deliberately configured explicit Unbound stub zones for the university's top level domain that pointed to our nameservers. At work, the intent of this was to be able to resolve in-university hostnames even if our Internet link went down. At home, well, I was copying the work configuration because that was easy and what was the harm in short-cutting lookups this way?
In case you are ever tempted to this, the answer is that you have
to be careful to keep your list of stub zone nameservers up to date,
and of course I hadn't. As long as my configuration didn't break
spectacularly I didn't give it any thought, and it turned out that
one of the IP addresses I had listed as a
stub-addr server doesn't
respond to me at all any more (and some of the others may not have
been entirely happy with me). If Unbound decided to send a query
twitter.com.utoronto.ca to that IP, well, it was going to be
waiting for a timeout. No wonder I periodically saw odd delays like
this (and stalls when I was trying to pull from or check
and so on).
(Twitter makes this much more likely by having an extremely short TTL on their A records, so they fell out of Unbound's cache on a regular basis and had to be re-queried.)
I don't know if short-cut stub zones for the university's forward
and reverse DNS is still a sensible configuration for my office
workstation's Unbound, but it definitely isn't for home usage. If
the university's Internet link is down, well, I'm outside it at
home; I'm not reaching any internal servers for either DNS lookups
or connections. So I've wound up taking it out of my home configuration
utoronto.ca names up just like any other domain.
(This elaborates on a Tweet of mine.)
Sidebar: The situation gets more mysterious
It's possible that this is actually a symptom of more than me just setting up a questionable caching DNS configuration and then failing to maintain and update it. In the process of writing this entry I decided to take another look at various university DNS data, and it turns out that the non-responding IP address I had in my Unbound configuration is listed as an official NS record for various university subdomains (including some that should be well maintained). So it's possible that something in the university's DNS infrastructure has fallen over or become incorrect without having been noticed.
(I wouldn't say that my Unbound DNS configuration was 'right', at least at home, but it does mean that my configuration might have kept working smoothly if not for this broader issue.)