Our answer to the ZFS SAN failover problem
August 4, 2008
A while back I wrote about the ZFS SAN failover problem, and recently a commentator asked what we've decided to do about it. Our current answer to the problem is simple but somewhat brutal: we're not going to do failover as such.
We're still including basic support for failover in our NFS server environment, things like virtual fileserver IPs and a naming convention for ZFS pools that includes what fileserver they are part of, but we're not trying to build any explicit failover support, especially automatic failover. If we have to fail over a fileserver, it will be a by-hand process.
Note that ZFS makes by-hand failover for NFS servers not very much
work because almost everything you need is already remembered by the
pools. All we'd need to do is get a list of the pools on the down
fileserver (made easy by the naming convention and '
Apart from the relative ease of doing manual failover if we have to, there are several mitigating factors that make this more sensible than it looks. First, it seems clear that we can't do automatic failover, because it is just too dangerous if anything goes wrong (and we don't trust ourselves to build a system that guarantees nothing will ever go wrong). This means that we are not losing much by not automating some of the by-hand work, and an after-hours problem won't get fixed any slower this way; in either case it has to wait for sysadmins to come in.
Second, given the slow speed of
Third, we're using generic hardware with mirrored, easily swapped system disks. This means that if even a single system disk has survived, we can transplant it into a spare chassis and (with a bit of work) bring the actual server back online; it might even be faster than failing over the fileserver. So to entirely lose a server, we have to lose both system disks at once, which we hope is a very rare event.
(This is when operating system bugs and sysadmin mistake come into the picture, of course.)
* * *
Atom feeds are available; see the bottom of most pages.