Sometimes why we have singleton machines is that failover is hard

February 28, 2015

One of our single points of failure around here is that we have any number of singleton machines that provide important service, for example DHCP for some of our most important user networks. We build such machines with amenities like mirrored system disks and we can put together a new instance in an hour or so (most of which just goes to copying things to the local disk), but that still means some amount of downtime in the event of a total failure. So why don't we build redundant systems for these things?

One reason is that there's a lot of services where failover and what I'll call 'cohabitation' is not easy. On the really easy side is something like caching DNS servers; it's easy to have two on the network at once and most clients can be configured to talk to both of them. If the first one goes down there will be some amount of inconvenience, but most everyone will wind up talking to the second one without anyone having to do anything. On the difficult side is something like a DHCP server with continually updated DHCP registration. You can't really have two active DHCP servers on the network at once, plus the backup one needs to be continually updated from the master. Switching from one DHCP server to the other requires doing something active, either by hand or through automation (and automation has hazards, like accidental or incomplete failover).

(In the specific case of DHCP you can make this easier with more automation, but then you have custom automation. Other services, like IMAP, are much less tractable for various reasons, although in some ways they're very easy if you're willing to tell users 'in an emergency change the IMAP server name to imap2.cs'.)

Of course this is kind of an excuse. Having a prebuilt second server for many of these things would speed up bringing the service back if the worst came to the worst, even if it took manual intervention. But it's a tradeoff here; prebuilding second servers would require more servers and at least partially complicate how we administer things. It's simpler if we don't wrestle with this and so far our servers have been reliable enough that I can't remember any failures.

(This reliability is important. Building a second server is in a sense a gamble; you're investing up-front effort in the hopes that it will pay off in the future. If there is no payoff because you never need the second server, your effort turns into pure overhead and you may wind up feeling stupid.)

Another part of this is that I think we simply haven't considered building second servers for most of these roles; we've never sat down to consider the pros and cons, to evaluate how many extra servers it would take, to figure out how critical some of these pieces of infrastructure really are, and so on. Some of our passive decisions here were undoubtedly formed at a time when how our networks were used looked different than it does now.

(Eg, it used to be the case that many fewer people brought in their own devices than today; the natural result of this is that a working 'laptop' network is now much more important than before. Similar things probably apply to our wireless network infrastructure, although somewhat less so since users have alternatives in an emergency (such as the campus-wide wireless network).)

Comments on this page:

By Ewen McNeill at 2015-03-01 00:27:30:

In both of your examples ("continually updated" DHCP server and IMAP server) it seems to me the problem is not the DHCP server or IMAP server itself -- it's the state (DHCP registrations, mail store). If you can abstract that state out, then you can trivially run two DHCP servers or IMAP servers in parallel, and they can probably actually run active-active without any real risk.

In the case of the DHCP state, it's the MAC registrations and MAC/IP associations. And you could abstract it by putting it into a network-accessible database on another machine. If that database were, eg, any RDBMS that supported replication, the database itself is no longer a point of failure (even if, eg, it's on the same hosts as the DHCP servers, although that might require master-master database replication to work).

In the case of the IMAP server the state is the messages in the store, and their status (read/unread, delivered/deleted, etc). To which the traditional answer would be shared storage. You'd probably need a multi-access safe file system to be active-active, but they exist too (NFS, GFS, etc -- NFS may or may not be ideal for IMAPs atomic changes on file store). That then gives you a shared storage point of failure unless you replicate that too, but shared storage replication is a pretty well solved problem (at least if you're willing to accept fuzziness around the edges -- and on IMAP you probably can).

I suspect this is true of most "difficult to replicate" cases: there's part of the thing which is easy to replicate, and part of the thing which is not easy to replicate. And if you can separate those two parts, you end up with a bunch of easy to replicate parts. And possibly some harder to replicate parts, which mostly can be transformed into solved problems (RDBMS replication, shared storage/replication).

Whether this separation is worth the effort depends in part on how often they fail, and the effect of downtime. If, eg, your DHCP server fails once every 3 years, and the effect is that no brand new machines can get leases but all "seen before" machines continue using there existing leases/last-used IPs, and you can build a new one before the leases run out... then maybe you don't have a problem that actually needs a solution.


PS: As an observation, the effort that goes into automating the build of a server config pays off if you ever need a second one, or would benefit from a second one. The second one is essentially free to build. (I suspect I wouldn't dedicate a physical host to, eg, a backup DHCP server if it wasn't run active-active, but I might well be prepared to have a prebuilt VM which would "do in a pinch" in addition to an automated build which would make a bare machine into a DHCP server in a matter of an hour should it be decided the original DHCP master hardware is irreparable.)

By cks at 2015-03-01 01:50:37:

My view is that the obvious state is only part of it. If you run two DHCP servers at once, for example, they will both attempt to answer a DHCP request; unless all IPs are statically assigned (which they aren't for us), this will give you odd results. To have only one of the two DHCP servers running at once requires some form of failover (broadly construed).

This is also more or less the issue with IMAP. We already have all of the mailboxes on shared storage, but only one thing can have the IP address of the IMAP server at a time. Theoretically we could give 'imap.cs' multiple IPs in DNS and then count on clients trying more than one of them if the first doesn't respond, but I'm not convinced that this would actually work well in practice in the face of clients that open multiple connections at once.

In terms of automated builds: we already have reproducible builds that run pretty much as fast as we could get them. The hour or so to spin up a new install of say the DHCP server is starting from grabbing a new uninstalled server.

By Ewen McNeill at 2015-03-01 04:48:31:

(Wow, this turned into a long reply... posting anyway in case it is useful to you/others.)

My recollection is that DHCP was explicitly designed to support multiple DHCP servers on the network at once, for a bunch of reasons. You get some very odd symptoms if you have multiple DHCP servers with wildly different views of what they network looks like (eg, different IP subnet and default gateway), that all answer for "generic" (non-tagged) queries. But providing all your DHCP servers hand out a coherent set of results (or only answer for different specific tags), it should be fine.

Part of the point of there being DHCP DISCOVER/OFFER/REQUEST/ACK stages is that multiple DHCP servers can answer with an OFFER, but the client will only REQUEST one of them (usually the first OFFER it hears). Providing you don't, eg, insist your IPs be assigned in montonically increasing order, without any gaps, it's pretty workable. Also IIRC most DHCP clients once they've found a helpful server will then switch to unicast to that server, rather than broadcast, until their lease with that server times out and it is still uncontactable. (As a trivial example without even a shared database you can have both DHCP servers having the same subnet/gateway/etc, but different pool ranges; if you lose a DHCP server for long enough eventually clients from the missing DHCP server will end up changing IPs, but that's about the worst symptom providing both pools are big enough for all active clients.)

FWIW, active-passive failover on layer 2 is typically handled with something like VRRP/HSRP/etc, of a virtual IP. And most of the daemons doing that can start a service as a result of being promoted/demoted, eg start/stop the DCHP service, or run some command (eg, unblock incoming UDP/67 and UDP/68 for DHCP).

The IMAP active-active case can be solved in a similar manner: put two IPs in the DNS, round-robin, if you want some simple load spreading. Then have those two IPs handled by VRRP, each with their own default active host. If for some reason one is down, both virtual IPs end up on the single host. When the host comes back up, the virtual IP designated for that host migrates back. (The main catch is ensuring the VRRP process waits long enough for the host to actually be ready for load, but that's mostly a matter of tuning the "alive" monitoring.)

I do tend to agree that the reliability of client implementations in trying multiple IPs returned by DNS seems... likely to be unevenly implemented. Hence two always-working IPs. If the virtual IP fails over the TCP connections (TCP/143, TCP/993) will break. But most IMAP clients do handle reconnecting fairly gracefully these days, if only to deal with the endless stream of wireless networks/roaming/etc. (I also expect that for sane IMAP servers this would work even for a client that had some connections to one server, and some to another server. Maildir in particular was designed to support atomic changes in the file system by multiple processes. And many IMAP servers are running multiple processes per user anyway -- from the multiple clients -- which IIRC do not do any, eg, shared memory communication; the disk is the shared state.)

None of which is to say that you have a problem that needs solving. Just that solving it is perhaps a bit more tractable than you might have first thought. Particularly with a pool of N spare hosts that can be "targeted in an hour", that might actually be sufficient in the real world if you can survive without the service for an hour. (And I'd expect that, eg, for DHCP you can, providing your lease times are 2+ hours.)


By Arnaud Gomes-do-Vale at 2015-03-01 05:14:14:

Just a quick note about DHCP: while failover has never been fully normalized AFAIR, it has been implemented for years in ISC DHCP, at least for IPv4. Basically you just set up two DHCP servers which split the available IP ranges between themselves. When server 1 fails, server 2 can only allocate IPs from its own ranges, but it can renew already-existing leases in server1's pool.

I must confess I have never used this feature myself; I tend to just set up an active-passive pair with keepalived or pacemaker and rsync the leases database every few minutes and this has never failed me.

 -- Arnaud
Written on 28 February 2015.
« Email from generic word domains is usually advance fee fraud spam
My view of the difference between 'pets' and 'cattle' »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sat Feb 28 23:40:14 2015
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.