Failover versus sparing in theory and in practice
Suppose that you have a fileserver infrastructure with some number of physical servers, a backend storage network, and some number of logical fileservers embodied on top of all of this. Broadly speaking, there are two strategies you can follow if one of those physical servers has problems. You can fail the logical fileserver the physical server hosts over to another machine, perhaps a hot spare server, or you can replace the physical host in place with some amount of spare hardware, for example by simply removing the system disks and putting them in a new server unit. Let's call these two options 'failover' and 'sparing'.
In theory, failover has a bunch of advantages, like that you can do it without physical access to the machines and that it survives more host failures (eg the system disks dying or the installed system getting corrupted). Also in theory our fileserver environment was deliberately engineered to support failover, for example by having the idea of 'logical fileservers' at all. In practice we've basically abandoned the use of failover; when serious hardware problems emerge our answer is almost always to spare the hardware out. There are at least two reasons for this.
First, failover in our environment is very slow. An ordinary ZFS pool import in an iSCSI environment with multiple pools and many iSCSI disks is impressively slow to start with, plus each fileserver has several pools to bring up, plus the other work of adding IP aliases and so on. In practice a failover takes long enough to qualify as 'achingly slow' and also significantly disruptive for NFS clients.
(I believe that we've also had issues with things like NFS lock state not fully recovering after a failover attempt. Possibly this could be worked around if we did the right things.)
Second, our backups are tied to the real hosts instead of the logical fileservers. Failing over a fileserver to a different real host for any length of time means that the backup system needs extensive and daunting mangling (or alternately we live with it abruptly doing full backups of terabytes of 'new' filesystems, throwing off the backup schedule for existing ones). This makes failover massively disruptive in practice for anything beyond short term things (where by 'short term' I mean 'before the next nightly backups run').
By contrast swapping fileserver hardware is easy, relatively fast, and is pretty much completely reliable unless the installed system has become corrupted somehow. To both the server and the clients it just looks like an extended crash or other downtime and things recover as well as they ever do from that. So far the only tricky bit about such hardware shifts has been getting the system to accept the 'new' Ethernet devices as its proper Ethernet devices.
We'll probably keep our current design on our new fileserver hardware, complete with the possibility for failover of a logical fileserver. But I don't expect it to work any better than before so we'll probably keep doing physical sparing of problem hardware even in the future.
(One thing writing this entry has pointed out to me is that we ought to work out a tested and documented procedure for transplanting system disks from machine to machine under OmniOS and our new hardware. Sooner or later we'll probably need it.)