2014-04-30
Failover versus sparing in theory and in practice
Suppose that you have a fileserver infrastructure with some number of physical servers, a backend storage network, and some number of logical fileservers embodied on top of all of this. Broadly speaking, there are two strategies you can follow if one of those physical servers has problems. You can fail the logical fileserver the physical server hosts over to another machine, perhaps a hot spare server, or you can replace the physical host in place with some amount of spare hardware, for example by simply removing the system disks and putting them in a new server unit. Let's call these two options 'failover' and 'sparing'.
In theory, failover has a bunch of advantages, like that you can do it without physical access to the machines and that it survives more host failures (eg the system disks dying or the installed system getting corrupted). Also in theory our fileserver environment was deliberately engineered to support failover, for example by having the idea of 'logical fileservers' at all. In practice we've basically abandoned the use of failover; when serious hardware problems emerge our answer is almost always to spare the hardware out. There are at least two reasons for this.
First, failover in our environment is very slow. An ordinary ZFS pool import in an iSCSI environment with multiple pools and many iSCSI disks is impressively slow to start with, plus each fileserver has several pools to bring up, plus the other work of adding IP aliases and so on. In practice a failover takes long enough to qualify as 'achingly slow' and also significantly disruptive for NFS clients.
(I believe that we've also had issues with things like NFS lock state not fully recovering after a failover attempt. Possibly this could be worked around if we did the right things.)
Second, our backups are tied to the real hosts instead of the logical fileservers. Failing over a fileserver to a different real host for any length of time means that the backup system needs extensive and daunting mangling (or alternately we live with it abruptly doing full backups of terabytes of 'new' filesystems, throwing off the backup schedule for existing ones). This makes failover massively disruptive in practice for anything beyond short term things (where by 'short term' I mean 'before the next nightly backups run').
By contrast swapping fileserver hardware is easy, relatively fast, and is pretty much completely reliable unless the installed system has become corrupted somehow. To both the server and the clients it just looks like an extended crash or other downtime and things recover as well as they ever do from that. So far the only tricky bit about such hardware shifts has been getting the system to accept the 'new' Ethernet devices as its proper Ethernet devices.
We'll probably keep our current design on our new fileserver hardware, complete with the possibility for failover of a logical fileserver. But I don't expect it to work any better than before so we'll probably keep doing physical sparing of problem hardware even in the future.
(One thing writing this entry has pointed out to me is that we ought to work out a tested and documented procedure for transplanting system disks from machine to machine under OmniOS and our new hardware. Sooner or later we'll probably need it.)
Backup systems, actual hosts, and logical hosts
One of the little but potentially important differences between backup systems is whether they can back up logical hosts or if, for one reason or another, they can only back up actual hosts. Since this sounds like a completely abstract situation, let's set up a concrete one.
Let's suppose that you have three fileserver hosts, call them A, B, and C, and two logical fileservers, fs1 and fs2 (and some sort of movable or shared storage system behind A, B, and B). Actual filesystems are associated with a logical fileserver while each logical fileserver is hosted on a particular machine (with one left over for a spare).
If your backup system will back up logical hosts, you can tell it 'back up fs1:/a/fred and fs2:/b/barney', have this work, and have the backup system associate things like index metadata about what file is in what backup run with these logical names. This is what you want because it means your backup system doesn't care which physical host fs1 and fs2 are on, which in turn makes it much easier to move fs1 from A to C in an emergency. However if your backup system insists on dealing with real hosts then you must tell it 'back up A:/a/fred and B:/b/barney', all of the index metadata and so on is associated with A and B, and the backup system will either explode or require manual attention if /a/fred ever winds up on C. This is obviously not really very desirable.
You might think that of course a backup system will back up logical hosts instead of insisting on real hosts. In practice there are all sorts of ways for a backup system to quietly need real hosts. Does the client software send the local hostname to the server as part of the protocol? Does the client software make network connections to the server and the server use the IP address those connections come from to do stuff like verify access rights, connect incoming backup streams to requested backups, or the like? Then your backup system might be implicitly requiring you to use real hosts.
(Even if the backup system theoretically copes with backing up logical hosts it may have limitations that will cause problems if two logical hosts ever wind up on the same real host or if you try to back up both the logical host and some stuff on the real host. This split between logical hosts and real hosts is a corner case and it exposes any number of potential issues.)