Why we may not be able to use ZFS
We would like to use ZFS in a new fileserver design, but we may not be able to. This is because ZFS in current versions of Solaris has a problem in production environments, namely: if ZFS ever can't write to any pool, it panics your entire machine. It doesn't matter how many other pools are fine, and it doesn't matter what else is running on the machine; down it goes.
(It also doesn't matter if the pool is redundant or not; ZFS behaves the same if enough of a redundant pool is out of action.)
ZFS is unable to write to a pool if:
- there is a write timeout, for example if an AoE target is unreachable for too long.
- it loses a device, for example if the Solaris initiator temporarily can't talk to an iSCSI target or the target temporarily rejects a login (which causes all of the disks to be offlined).
- it thinks that a device has been replaced by a different device, for example if a SAN logical disk's serial number changes for some reason.
- the device simply rejects writes.
(An unwriteable device is real fun, because ZFS attempts to write to the pool when it brings the pool up; if the pool is one the machine tries to start when it boots, your machine will sit there in an endless loop of boot-time panics.)
In a SAN environment, this can happen whether you export raw disks and have ZFS do the RAID on top of them or export logical disks and have the SAN controller handle the RAID. You can design a raw disk SAN environment where any fileserver can survive losing any one RAID controller, but it is awfully interconnected.
(And if you are willing to lose half your disk space, you can mirror everything across controllers. We have been told by our users that they are not willing to pay for that level of redundancy.)
Sun's fix for this bug is in OpenSolaris Nevada build 77, but is apparently not going to be available in Solaris before Solaris 10 update 5, currently scheduled for April 2008. (I cannot say that I am pleased about this timeline, given that our own bug report is more than six months old and Sun has known about this issue for at least a year or so.)
If we only build our fileserver environment after S10U5 is released, the decision of whether or not to use ZFS is easy.
If we build our new environment before then (and we probably should), the decision is less easy. ZFS gets us significant user features, but using ZFS means we run some risk of catastrophic panics that take down one or several fileservers due to a single SAN backend controller having a temporary glitch.
(We're willing to say that losing the network or multiple controllers is a catastrophic failure, partly because the alternative is a DiskSuite based environment and DiskSuite has its own issues with failures.)
Right now we're mulling over how bad the risks are in practice, and how compelling the benefits of ZFS are in our contemplated design. I probably need to sketch out some sort of failure matrix to try to figure out all of the things that could go wrong and what the impact of each of them would be in a carefully set up environment as compared to an equivalent DiskSuite environment.
(Until very recently, this entry would have had a much stronger title, but Sun actually does have a fix now.)