Software should support configuring overall time limits
It's pretty common for software to support setting various sorts
of time limits on operations, often in extensive detail. You can
often set retry counts and so on as well. All of this is natural
because it generally maps quite well to the low level operations
that the software itself can set internal limits on, so you get
things like the OpenSSH client
ConnectTimeout setting, which
basically controls how long
ssh will wait for its
system call to succeed.
More and more, I have come to feel that this way of configuring time limits is not as helpful in real life as you might think, and yesterday's events provide a convenient example for why. There are several problems. First, low level detailed time limits, retry counts, and so on don't particularly correspond to what you often really want, namely a limit on how long the entire high level operation can take. We now want to put a limit on the total maximum IO delay that ZFS can ever see, but there's no direct control for that, only low-level ones that might do this indirectly if we can find them and sort through all of the layers involved.
Second, the low level limits can interact with each other in ways that are hard to see in advance and your actual timeouts can wind up (much) higher than you think. This is especially easy to have happen if you have multiple layers and there are retries involved. People who deal with disk subsystems have probably seen many cases where the physical disk retries a few times and then gives up, then the OS software tries a few times (each of which provokes another round of physical disk retries), and so on. Each of these layers might be perfectly sensible if it was the only layer in action, but put them all together and things go out to lunch and don't come back.
Third, it can actually be impossible to put together systems that are reliable and that also have reliable high level time limits given only low level time and retry controls. HDs are a good example of this. Disk IO operations, especially writes, can take substantial amounts of time to complete under load in normal operation (over 30 seconds). And some of the time retrying a failed operation at least once will cause it to succeed, because the failure was purely a temporary fluctuation and glitch. But the combination of these settings, each individually necessary, can give you a too-high total timeout, leaving you with no good choice.
(Generally you wind up allowing too high total timeouts, because the other option is to risk a system that falls apart explosively under load as everything slows down.)
Real support for overall time limits requires more code, since you actually have to track and limit the total time operations take (and you may need to abort operations in mid flight when their timer expires). But it is often quite useful for system administrators, since it lets us control what we often really care about and need to limit. Life would probably be easier right now if I could just tell the OmniOS scsi_vhci multipathing driver to time out any IO that takes more than, say, five minutes and return an error for it.
(Of course this points out that you may also want the low level limits too, either exposed externally or implemented internally with sensible values. If I'm going to tell a multipathing driver that it should time out IO after five minutes, I probably want to time out IOs to individual paths faster than that so the driver has time to retry an IO over an alternate path.)
PS: Extension of this idea to other sorts of low level limits is left as an exercise for the reader.
ZFS may panic your system if you have an exceptionally slow IO
Today, one of our ZFS fileservers paniced. The panic itself is quite straightforward:
genunix: I/O to pool 'fs0-core-01' appears to be hung. genunix: ffffff007a0c5a20 zfs:vdev_deadman+10b () [...] genunix: ffffff007a0c5af0 zfs:spa_deadman+ad () [...]
/* * Look at the head of all the pending queues, * if any I/O has been outstanding for longer than * the spa_deadman_synctime we panic the system. */
The spa_deadman_synctime value comes from
/* * [...] * Secondly, the value determines if an I/O is considered "hung". * Any I/O that has not completed in zfs_deadman_synctime_ms is * considered "hung" resulting in a system panic. */ uint64_t zfs_deadman_synctime_ms = 1000000ULL;
That's 1000 seconds, or 16 minutes and 40 seconds.
By 'completed' I believe that ZFS includes 'has resulted in an error', including a timeout error from eg the SCSI system. Normally you would expect IO systems to time out IO requests well before 16 minutes, but apparently something in our multipathed iSCSI setup did not do this and so ZFS pushed the big red button of a panic.
(This is a somewhat dangerous assumption under some circumstances. If you have a ZFS pool built from files from an NFS mounted filesystem, for example, NFS will wait endlessly for server IO to complete. And while this is extreme, there are vaguely plausible situations where file-backed ZFS pools make some sense.)
Note that this behavior is completely unrelated to the ZFS pool
failmode setting. It can happen before ZFS reports any pool errors,
and it can happen when the only problem is a single IO to a single
underlying disk (and the pool has retained full redundancy throughout
and so on). All it needs is one hung IO to one device used by one
pool and your entire system can panic (and then sit there while it
slowly writes out a crash dump, if you have those configured).
However, I've decided that I'm not particularly upset by this. The fileserver was in some trouble before the panic (I assume due to IO problems to iSCSI backend disk(s)), and rebooting seems to have fixed things for now. At least some of the times, panicing and retrying from scratch is a better strategy than banging your head against the wall over and over; this time seems to be one of them.
(I might feel differently if we had important user level processes running on these machines, like database servers or the like.)
In the short term we're unlikely to change this deadman timeout or disable it. I'm more interested in trying to find out what our iSCSI IO timeouts actually are and see if we can lower them so that the kernel will spit out timeout errors well before that much time goes by (say a couple of minutes at the outside). Unfortunately there are a lot of levels and moving parts involved here, so things are likely to be complex (and compounding on each other).
Sidebar: The various levels I think we have in action here
From the top downwards: OmniOS scsi_vhci multipathing, OmniOS generic SCSI, OmniOS iSCSI initiator, our Linux iSCSI target, the generic Linux block and SCSI layers, the Linux mpt2sas driver, and then the physical SSDs involved. Probably some of these levels do some amount of retrying of timed out requests before they pass problems back to higher levels, which of course compounds this sort of issue (and complicates tuning it).