Software should support configuring overall time limits

January 2, 2017

It's pretty common for software to support setting various sorts of time limits on operations, often in extensive detail. You can often set retry counts and so on as well. All of this is natural because it generally maps quite well to the low level operations that the software itself can set internal limits on, so you get things like the OpenSSH client ConnectTimeout setting, which basically controls how long ssh will wait for its connect() system call to succeed.

More and more, I have come to feel that this way of configuring time limits is not as helpful in real life as you might think, and yesterday's events provide a convenient example for why. There are several problems. First, low level detailed time limits, retry counts, and so on don't particularly correspond to what you often really want, namely a limit on how long the entire high level operation can take. We now want to put a limit on the total maximum IO delay that ZFS can ever see, but there's no direct control for that, only low-level ones that might do this indirectly if we can find them and sort through all of the layers involved.

Second, the low level limits can interact with each other in ways that are hard to see in advance and your actual timeouts can wind up (much) higher than you think. This is especially easy to have happen if you have multiple layers and there are retries involved. People who deal with disk subsystems have probably seen many cases where the physical disk retries a few times and then gives up, then the OS software tries a few times (each of which provokes another round of physical disk retries), and so on. Each of these layers might be perfectly sensible if it was the only layer in action, but put them all together and things go out to lunch and don't come back.

Third, it can actually be impossible to put together systems that are reliable and that also have reliable high level time limits given only low level time and retry controls. HDs are a good example of this. Disk IO operations, especially writes, can take substantial amounts of time to complete under load in normal operation (over 30 seconds). And some of the time retrying a failed operation at least once will cause it to succeed, because the failure was purely a temporary fluctuation and glitch. But the combination of these settings, each individually necessary, can give you a too-high total timeout, leaving you with no good choice.

(Generally you wind up allowing too high total timeouts, because the other option is to risk a system that falls apart explosively under load as everything slows down.)

Real support for overall time limits requires more code, since you actually have to track and limit the total time operations take (and you may need to abort operations in mid flight when their timer expires). But it is often quite useful for system administrators, since it lets us control what we often really care about and need to limit. Life would probably be easier right now if I could just tell the OmniOS scsi_vhci multipathing driver to time out any IO that takes more than, say, five minutes and return an error for it.

(Of course this points out that you may also want the low level limits too, either exposed externally or implemented internally with sensible values. If I'm going to tell a multipathing driver that it should time out IO after five minutes, I probably want to time out IOs to individual paths faster than that so the driver has time to retry an IO over an alternate path.)

PS: Extension of this idea to other sorts of low level limits is left as an exercise for the reader.

Comments on this page:

I’ve run into the opposite case more often.

I particularly remember dealing with some HTTP client which only allowed me to set a timeout for the full request/response cycle… which would mean aborting large or slow downloads. I didn’t want to lose any completable downloads ever (well, almost), no matter how long they took – I just wanted to limit how long a hopeless connection attempt could hang the process. But there was no way of specifying that. (I don’t remember what I ended up doing, I think I switched to another client.)

I’m not surprised to hear you having the opposite experience, though.

The moral of the story seems to be that software needs to systematically support configurable timeouts at all levels of an operation, not just (presumably) haphazardly expose a handful of timeouts where some underlying API makes that convenient. (I assume that’s mostly what drives the choice of configurable settings.)

By Tom at 2017-01-03 19:12:19:

The trick is to determine what the top-layer is? You talk about the layers in the disk, but on top of that, you can have a db writing with its own timeouts and retries, and then a remote server talking to the db, etc. Each of which views itself as the top layer.

By Anon at 2017-01-03 19:16:34:

Linux has the ability to set timeouts for disks because people became fed up of not knowing how long it would take for a disk to be marked as failed (due to stacked error recovery procedures). See the eh_deadline parameter ( ).

By cks at 2017-01-04 10:49:33:

Tom: my view is that every layer should support optional total time limits. Certainly if a layer has any sort of timeouts and retries at all, it should also support a total time limit.

As far as Aristotle's issue goes, I agree, but I think we're also running into a broad question of what timeouts are for. There's at least two purposes for timeouts: to make sure an operation is 'making progress', and to limit how long an operation takes regardless of how much progress it's made (total time limits are clearly one version of the latter). People designing timeouts will ideally think about both cases and not lock you in to only having one or the other sort of timeouts.

The information about eh_deadline is interesting; thanks. Sadly I don't know if the Linu side is even the problem or if it's returning errors reasonably promptly but the OmniOS side is fiddling around with retries and shifting paths and so on. I'm not sure if we have enough information logged to work out where the timeout really got stuck.

Written on 02 January 2017.
« ZFS may panic your system if you have an exceptionally slow IO
Make sure that (system) email works on every machine »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Jan 2 22:46:12 2017
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.