== The problem with Solaris 10 update 6's ZFS _failmode_ setting After [[I was so negative on ZFS's new _failmode_ setting ZFSAndSolaris10U6]], one might sensibly ask what the problem with it is. (Background: the ZFS _failmode_ setting controls what happens when ZFS can't perform IO to a pool because the pool has totally lost redundancy. It has three settings, one to panic your system (just like the [[old behavior ZFSWritePanic]]), one to block all IO until the devices recover, and one to continue as much as possible.) The problem that I observed in our iSCSI based environment is that if you use any non-panic _failmode_ setting, a ZFS pool failure of this sort eventually winds up hanging the kernel's entire ZFS infrastructure (piece by piece; it does not happen all at once). This partially affects even unrelated pools, pools that are still fully intact. The hang persists even if connectivity to the disks returns, and is so thorough that the system will not reboot; I consistently had to power-cycle our test server in order to recover it. The direct cause of the hang seems to be asking the kernel for detailed ZFS pool information about a problem pool (after enough time has elapsed). Running '_zpool status_' is one way to cause this to happen (even on unrelated pools), but it gets worse; _fmd_ (the [[useless fault manager daemon FaultManagerIrritation]]) also asks the kernel for this information every so often, thereby guaranteeing that this happens no matter what you do. As far as I can tell, you cannot really disable _fmd_ without causing huge problems. The net effect is that in a failure, ~~your ZFS pool hangs irretrievably after a while~~, eventually taking much of the rest of the system with it. For us, this is actually far worse than the system panicing and rebooting without some ZFS pools. (I managed to capture some kernel crash dumps; the affected processes, including _sync_, seemed to be stuck in ((zfs_ioc_pool_stats)) in the kernel.) (This is probably a [[known bug http://permalink.gmane.org/gmane.os.solaris.opensolaris.zfs/20257]]. Insert rant here about an 'enterprise ready' operating system where you cannot run fault diagnosis programs during a fault without making the situation much, much worse.)