Sometimes why we have singleton machines is that failover is hard

February 28, 2015

One of our single points of failure around here is that we have any number of singleton machines that provide important service, for example DHCP for some of our most important user networks. We build such machines with amenities like mirrored system disks and we can put together a new instance in an hour or so (most of which just goes to copying things to the local disk), but that still means some amount of downtime in the event of a total failure. So why don't we build redundant systems for these things?

One reason is that there's a lot of services where failover and what I'll call 'cohabitation' is not easy. On the really easy side is something like caching DNS servers; it's easy to have two on the network at once and most clients can be configured to talk to both of them. If the first one goes down there will be some amount of inconvenience, but most everyone will wind up talking to the second one without anyone having to do anything. On the difficult side is something like a DHCP server with continually updated DHCP registration. You can't really have two active DHCP servers on the network at once, plus the backup one needs to be continually updated from the master. Switching from one DHCP server to the other requires doing something active, either by hand or through automation (and automation has hazards, like accidental or incomplete failover).

(In the specific case of DHCP you can make this easier with more automation, but then you have custom automation. Other services, like IMAP, are much less tractable for various reasons, although in some ways they're very easy if you're willing to tell users 'in an emergency change the IMAP server name to imap2.cs'.)

Of course this is kind of an excuse. Having a prebuilt second server for many of these things would speed up bringing the service back if the worst came to the worst, even if it took manual intervention. But it's a tradeoff here; prebuilding second servers would require more servers and at least partially complicate how we administer things. It's simpler if we don't wrestle with this and so far our servers have been reliable enough that I can't remember any failures.

(This reliability is important. Building a second server is in a sense a gamble; you're investing up-front effort in the hopes that it will pay off in the future. If there is no payoff because you never need the second server, your effort turns into pure overhead and you may wind up feeling stupid.)

Another part of this is that I think we simply haven't considered building second servers for most of these roles; we've never sat down to consider the pros and cons, to evaluate how many extra servers it would take, to figure out how critical some of these pieces of infrastructure really are, and so on. Some of our passive decisions here were undoubtedly formed at a time when how our networks were used looked different than it does now.

(Eg, it used to be the case that many fewer people brought in their own devices than today; the natural result of this is that a working 'laptop' network is now much more important than before. Similar things probably apply to our wireless network infrastructure, although somewhat less so since users have alternatives in an emergency (such as the campus-wide wireless network).)

Written on 28 February 2015.
« Email from generic word domains is usually advance fee fraud spam
My view of the difference between 'pets' and 'cattle' »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sat Feb 28 23:40:14 2015
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.