ZFS may panic your system if you have an exceptionally slow IO

January 2, 2017

Today, one of our ZFS fileservers paniced. The panic itself is quite straightforward:

genunix: I/O to pool 'fs0-core-01' appears to be hung.
genunix: ffffff007a0c5a20 zfs:vdev_deadman+10b ()
genunix: ffffff007a0c5af0 zfs:spa_deadman+ad ()

The spa_deadman function is to be found in spa_misc.c and vdev_deadman is in vdev.c. The latter has the important comment:

 * Look at the head of all the pending queues,
 * if any I/O has been outstanding for longer than
 * the spa_deadman_synctime we panic the system.

The spa_deadman_synctime value comes from zfs_deadman_synctime_ms, in spa_misc.c:

 * [...]
 * Secondly, the value determines if an I/O is considered "hung".
 * Any I/O that has not completed in zfs_deadman_synctime_ms is
 * considered "hung" resulting in a system panic.
uint64_t zfs_deadman_synctime_ms = 1000000ULL;

That's 1000 seconds, or 16 minutes and 40 seconds.

By 'completed' I believe that ZFS includes 'has resulted in an error', including a timeout error from eg the SCSI system. Normally you would expect IO systems to time out IO requests well before 16 minutes, but apparently something in our multipathed iSCSI setup did not do this and so ZFS pushed the big red button of a panic.

(This is a somewhat dangerous assumption under some circumstances. If you have a ZFS pool built from files from an NFS mounted filesystem, for example, NFS will wait endlessly for server IO to complete. And while this is extreme, there are vaguely plausible situations where file-backed ZFS pools make some sense.)

Note that this behavior is completely unrelated to the ZFS pool failmode setting. It can happen before ZFS reports any pool errors, and it can happen when the only problem is a single IO to a single underlying disk (and the pool has retained full redundancy throughout and so on). All it needs is one hung IO to one device used by one pool and your entire system can panic (and then sit there while it slowly writes out a crash dump, if you have those configured).

However, I've decided that I'm not particularly upset by this. The fileserver was in some trouble before the panic (I assume due to IO problems to iSCSI backend disk(s)), and rebooting seems to have fixed things for now. At least some of the times, panicing and retrying from scratch is a better strategy than banging your head against the wall over and over; this time seems to be one of them.

(I might feel differently if we had important user level processes running on these machines, like database servers or the like.)

In the short term we're unlikely to change this deadman timeout or disable it. I'm more interested in trying to find out what our iSCSI IO timeouts actually are and see if we can lower them so that the kernel will spit out timeout errors well before that much time goes by (say a couple of minutes at the outside). Unfortunately there are a lot of levels and moving parts involved here, so things are likely to be complex (and compounding on each other).

Sidebar: The various levels I think we have in action here

From the top downwards: OmniOS scsi_vhci multipathing, OmniOS generic SCSI, OmniOS iSCSI initiator, our Linux iSCSI target, the generic Linux block and SCSI layers, the Linux mpt2sas driver, and then the physical SSDs involved. Probably some of these levels do some amount of retrying of timed out requests before they pass problems back to higher levels, which of course compounds this sort of issue (and complicates tuning it).

Written on 02 January 2017.
« I wish new editors thought about their overall ecology too
Software should support configuring overall time limits »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Jan 2 01:42:47 2017
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.