The problem with ZFS, SANs, and failover

June 12, 2008

The fundamental problem with Solaris 10's current version of ZFS in a SAN failover environment is that it has no concept of locking or host ownership of ZFS pools; instead, ZFS pools are either active (imported) or inactive (exported). So, if a host crashes and you want to fail over its ZFS pools, the pools are still marked as active, which means you must force pool import, which has catastrophic consequences if something ever goes wrong.

But it gets worse. Because hosts don't normally deactivate their pools when they shut down, a booting host will grab all of the pools that it thinks it should have regardless of their active versus inactive status and thus regardless of whether they are being used by another machine, because it cannot tell.

You can set ZFS pools so that they aren't imported automatically on boot (by using 'zpool import -R / ...'). However, in our iSCSI SAN environment each zpool import takes approximately a third of a second per LUN; a third of a second per LUN times a bunch of LUNs times a bunch of pools is an infeasibly long amount of time.

(And before you ask, as far as I can tell there is no opportunity to do something clever before Solaris auto-imports all of the pools during boot because the auto-importation happens by special magic.)

The conclusion is that if a host crashes and you want to fail over its pools, you must make utterly sure that it will never spontaneously reappear. (I recommend going down to the machine room and pulling its power and network cables and removing its disks. Make sure you zero them before you return them to any spares pool you have.)

If you try hard enough there are ways around some of this, such as storage fencing, where you arrange with your backend so that each host in the SAN can only see the storage with the pools that it should be importing. But this is going to complicate your SAN and your failover, and again if anything ever goes wrong you will have catastrophic explosions.

(Much of this is fixed in things currently scheduled for Solaris 10 update 6. Unfortunately we need to start deploying our new fileserver environment before then.)

Comments on this page:

From at 2008-08-04 12:25:59:

Chris, we're in about the same boat (CX3-40 SAN and Solaris 10u5). We've got a X4500 that works beautifully with ZFS and SAMFS/QFS (using a SL500 library). But since we have a nice CX infrastructure complete with replication to another building across campus, I'd like to take advantage of it. These discussions I've heard on zfs-discuss about SAN failover have sort of put a damper on that idea, though our testing so far has only been able to replicate the problem when removing the LUN(s) from the storage group (so far no "random" panics). We'd also like to go production this month, so waiting for U6 really isn't an option. What have you decided to do?

One saving grace for our implementation is that we're using SAMQ to provide another layer of resilience. No, it won't improve HA (a panic is still a panic). But at least I feel less nervous about losing pools because we'll get (almost) everything back from the archive layers. Have you considered this? No, you shouldn't be required to implement such a layer just because ZFS won't manage to crap out in this manner. But IMO if you've got more than a few TB to store, you really need a data management solution.

Any thoughts?

Charles Soto
The University of Texas at Austin

By cks at 2008-08-04 15:16:09:

I put the long answer in OurZFSSanFailoverAnswer; the short answer is that we're not doing failover as such, although we're building the basic infrastructure to enable it in the future. If we have to fail over a NFS fileserver, we'll do it by hand. (The NFS fileserver case is the easiest one, since those are mostly stateless machines that are all identical except for what pools they have.)

We're unlikely to get any sort of data management solution more advanced than Amanda, partly because of money issues, so that's not a really good protection against pools becoming entirely corrupt. (Even with a full backup, restoring is a slow process.)

Written on 12 June 2008.
« Designing a usable DNS Blocklist result format
The cost of virtualization »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Thu Jun 12 23:41:54 2008
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.