Software should support configuring overall time limits

January 2, 2017

It's pretty common for software to support setting various sorts of time limits on operations, often in extensive detail. You can often set retry counts and so on as well. All of this is natural because it generally maps quite well to the low level operations that the software itself can set internal limits on, so you get things like the OpenSSH client ConnectTimeout setting, which basically controls how long ssh will wait for its connect() system call to succeed.

More and more, I have come to feel that this way of configuring time limits is not as helpful in real life as you might think, and yesterday's events provide a convenient example for why. There are several problems. First, low level detailed time limits, retry counts, and so on don't particularly correspond to what you often really want, namely a limit on how long the entire high level operation can take. We now want to put a limit on the total maximum IO delay that ZFS can ever see, but there's no direct control for that, only low-level ones that might do this indirectly if we can find them and sort through all of the layers involved.

Second, the low level limits can interact with each other in ways that are hard to see in advance and your actual timeouts can wind up (much) higher than you think. This is especially easy to have happen if you have multiple layers and there are retries involved. People who deal with disk subsystems have probably seen many cases where the physical disk retries a few times and then gives up, then the OS software tries a few times (each of which provokes another round of physical disk retries), and so on. Each of these layers might be perfectly sensible if it was the only layer in action, but put them all together and things go out to lunch and don't come back.

Third, it can actually be impossible to put together systems that are reliable and that also have reliable high level time limits given only low level time and retry controls. HDs are a good example of this. Disk IO operations, especially writes, can take substantial amounts of time to complete under load in normal operation (over 30 seconds). And some of the time retrying a failed operation at least once will cause it to succeed, because the failure was purely a temporary fluctuation and glitch. But the combination of these settings, each individually necessary, can give you a too-high total timeout, leaving you with no good choice.

(Generally you wind up allowing too high total timeouts, because the other option is to risk a system that falls apart explosively under load as everything slows down.)

Real support for overall time limits requires more code, since you actually have to track and limit the total time operations take (and you may need to abort operations in mid flight when their timer expires). But it is often quite useful for system administrators, since it lets us control what we often really care about and need to limit. Life would probably be easier right now if I could just tell the OmniOS scsi_vhci multipathing driver to time out any IO that takes more than, say, five minutes and return an error for it.

(Of course this points out that you may also want the low level limits too, either exposed externally or implemented internally with sensible values. If I'm going to tell a multipathing driver that it should time out IO after five minutes, I probably want to time out IOs to individual paths faster than that so the driver has time to retry an IO over an alternate path.)

PS: Extension of this idea to other sorts of low level limits is left as an exercise for the reader.

Written on 02 January 2017.
« ZFS may panic your system if you have an exceptionally slow IO
Make sure that (system) email works on every machine »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Jan 2 22:46:12 2017
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.