Wandering Thoughts archives

2008-04-21

Dear ZFS: please stop having your commands stall

One of my serious irritations with ZFS is how various ZFS commands (or at least sub-commands of zpool) will stall badly if it can't talk to some of the underlying disk devices. This is especially apparent when you're using iSCSI; for example, I accidentally booted a system with the iSCSI cable plugged into the wrong port, and once the system had booted zpool list simply hung.

(Worse, it hung uninterruptibly; I could not stop it with ^C, use job control to background it, make it abort with ^\, or even kill -9 it from another window.)

One of the really unfortunate effects of this is that it really hampers my ability to do a lot of diagnostic work, because both zpool status and zpool iostat -v stall or run very, very slowly. (iostat itself works fine, which makes me really irritated with ZFS.)

(It is possible that Solaris MPxIO is contributing to this, since our 'iSCSI' devices are actually the MPxIO versions, but as a sysadmin I don't care exactly why the ZFS commands stall, just that they do. The downside of Sun owning the entire stack is that they don't get to point fingers at anyone else.)

I believe that ZFS commands behave okay if the iSCSI machine is explicitly rejecting Solaris's connection attempts (or reporting that the target or the LUN doesn't exist or the like). What seems to be near-fatal is when the iSCSI target simply isn't responding. Unfortunately this is the most likely failure mode; switch failure, controller failure, controller rebooting, etc.

(You also get the same issue if the iSCSI target is responding very, very slowly, as I found out when our theoretically jumbo frame capable gigabit switch decided to switch jumbo frames so slowly that it had a bandwidth measured in kilobytes per second.)

solaris/ZFSZpoolStalls written at 23:02:04;


Page tools: See As Normal.
Search:
Login: Password:

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.