Our answer to the ZFS SAN failover problem

August 4, 2008

A while back I wrote about the ZFS SAN failover problem, and recently a commentator asked what we've decided to do about it. Our current answer to the problem is simple but somewhat brutal: we're not going to do failover as such.

We're still including basic support for failover in our NFS server environment, things like virtual fileserver IPs and a naming convention for ZFS pools that includes what fileserver they are part of, but we're not trying to build any explicit failover support, especially automatic failover. If we have to fail over a fileserver, it will be a by-hand process.

Note that ZFS makes by-hand failover for NFS servers not very much work because almost everything you need is already remembered by the pools. All we'd need to do is get a list of the pools on the down fileserver (made easy by the naming convention and 'zpool import'), import them all on another server, and add the virtual fileserver IP as an alias on the new server.

Apart from the relative ease of doing manual failover if we have to, there are several mitigating factors that make this more sensible than it looks. First, it seems clear that we can't do automatic failover, because it is just too dangerous if anything goes wrong (and we don't trust ourselves to build a system that guarantees nothing will ever go wrong). This means that we are not losing much by not automating some of the by-hand work, and an after-hours problem won't get fixed any slower this way; in either case it has to wait for sysadmins to come in.

Second, given the slow speed of zpool import in a SAN environment, any failover is a very slow process (we're looking at tens of minutes). Since even automatic failover would be very user visible, having manual failover be more user visible is not necessarily a huge step worse. This also means that the only time any sort of failover makes sense is when a server has failed entirely.

Third, we're using generic hardware with mirrored, easily swapped system disks. This means that if even a single system disk has survived, we can transplant it into a spare chassis and (with a bit of work) bring the actual server back online; it might even be faster than failing over the fileserver. So to entirely lose a server, we have to lose both system disks at once, which we hope is a very rare event.

(This is when operating system bugs and sysadmin mistake come into the picture, of course.)

Written on 04 August 2008.
« A performance gotcha with syslogd
SSL does not create trust »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Aug 4 15:05:57 2008
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.