2008-04-21
Dear ZFS: please stop having your commands stall
One of my serious irritations with ZFS is how various ZFS commands (or
at least sub-commands of zpool
) will stall badly if it can't talk to
some of the underlying disk devices. This is especially apparent when
you're using iSCSI; for example, I accidentally booted a system with the
iSCSI cable plugged into the wrong port, and once the system had booted
zpool list
simply hung.
(Worse, it hung uninterruptibly; I could not stop it with ^C, use job
control to background it, make it abort with ^\, or even kill -9
it
from another window.)
One of the really unfortunate effects of this is that it really hampers
my ability to do a lot of diagnostic work, because both zpool status
and zpool iostat -v
stall or run very, very slowly. (iostat
itself
works fine, which makes me really irritated with ZFS.)
(It is possible that Solaris MPxIO is contributing to this, since our 'iSCSI' devices are actually the MPxIO versions, but as a sysadmin I don't care exactly why the ZFS commands stall, just that they do. The downside of Sun owning the entire stack is that they don't get to point fingers at anyone else.)
I believe that ZFS commands behave okay if the iSCSI machine is explicitly rejecting Solaris's connection attempts (or reporting that the target or the LUN doesn't exist or the like). What seems to be near-fatal is when the iSCSI target simply isn't responding. Unfortunately this is the most likely failure mode; switch failure, controller failure, controller rebooting, etc.
(You also get the same issue if the iSCSI target is responding very, very slowly, as I found out when our theoretically jumbo frame capable gigabit switch decided to switch jumbo frames so slowly that it had a bandwidth measured in kilobytes per second.)