How not to do IO multiplexing, as illustrated by Amanda

September 12, 2014

Every so often I have a belated slow-motion realization about what's probably wrong with an otherwise mysterious problem. Sometimes this is even sparked by remembering an apparently unrelated thing I read in passing. As it happens, that happened the other day.

Let's rewind to this entry, where I wrote about what I'd discovered while looking into our slow Amanda backups. Specifically this was a dump of amandad handling multiple streams of backups at once, which we determined is the source of our slowness. In that entry I wrote in passing:

[Amandad is] also spending much more of its IO wait time writing the data rather than waiting for there to be more input, although the picture here is misleading because it's also making pollsys() calls and I wasn't tracking the time spent waiting in those [...]

This should have set off big alarm bells. If amandad is using poll(), why is it spending any appreciable amount of time waiting for writes to complete? After all the whole purpose of poll() et al is to only be woken when you can do work, so you should spend minimal time blocked in the actual IO functions. The unfortunate answer is that Amanda is doing IO multiplexing wrong, in that I believe it's only using poll() to check for readability on its input FDs, not writability on its output FDs. Instead it tacitly assumes that whenever it has data to read it can immediately write all of this data out with no fuss, muss, or delays.

(Just checking for writability on the network connections wouldn't be quite enough, of course, but that's another entry.)

The problem is that this doesn't necessarily work. You can easily have situations where one TCP stream will accept much more data than another one, or where all (or just most) TCP streams will accept only a modest amount of data at a time; this is especially likely when the TCP streams are connected to different processes on the remote end. If one remote process stalls its TCP stream can stop accepting much more data, at which point a large write to the stream may stall in turn, which stalls all amandad activity even if it could succeed, which stalls upstream activity that is trying to send data to amandad. What amandad's handling of multiplexing does is to put all of the data streams flowing through it at the mercy of whatever is the slowest write stream at any given point in time. If anything blocks everything can block, and sooner or later we seem to wind up in a situation where anything can and does block. The result is a stuttering stop-start process full of stalls that reduces the overall data flow significantly.

In short, what you don't want in good IO multiplexing is a situation where IO on stream A has to wait because you're stalled doing IO on stream B, and that is just what amandad has arranged here.

Correct multiplexing is complex (even in the case of a single flow) but the core of it is never overcommitting yourself. What amandad should be doing is only writing as much output at a time as any particular connection can take, buffering some data internally, and passing any back pressure up to the things feeding it data (by stopping reading its inputs when it cannot write output and it has too much buffered). This would ensure that the multiplexing does no harm in that it can always proceed if any of the streams can proceed.

(Splitting apart the multiplexing into separate processes (or threads) does the same thing; because each one is only handling a single stream, that stream blocking doesn't block any other stream.)

PS: good IO multiplexing also needs to be fair, ie if if multiple streams or file descriptors can all do IO at the current time then all of them do some IO, instead of a constant stream of ready IO on one stream starving IO for other streams. Generally the easiest way to do this is to have your code always process all ready file descriptors before returning to poll(). This is also the more CPU-efficient way to do it.

Written on 12 September 2014.
« The cause of our slow Amanda backups and our workaround
What can go wrong with polling for writability on blocking sockets »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Sep 12 01:10:43 2014
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.