How not to do IO multiplexing, as illustrated by Amanda
Every so often I have a belated slow-motion realization about what's probably wrong with an otherwise mysterious problem. Sometimes this is even sparked by remembering an apparently unrelated thing I read in passing. As it happens, that happened the other day.
Let's rewind to this entry, where
I wrote about what I'd discovered while looking into our slow
Amanda backups. Specifically
this was a dump of
amandad handling multiple streams of backups
at once, which we determined is the source of our slowness. In that entry I wrote in passing:
[Amandad is] also spending much more of its IO wait time writing the data rather than waiting for there to be more input, although the picture here is misleading because it's also making
pollsys()calls and I wasn't tracking the time spent waiting in those [...]
This should have set off big alarm bells. If
amandad is using
poll(), why is it spending any appreciable amount of time waiting
for writes to complete? After all the whole purpose of
et al is to only be woken when you can do work, so you should spend
minimal time blocked in the actual IO functions. The unfortunate
answer is that Amanda is doing IO multiplexing wrong, in that I
believe it's only using
poll() to check for readability on its
input FDs, not writability on its output FDs. Instead it tacitly
assumes that whenever it has data to read it can immediately write
all of this data out with no fuss, muss, or delays.
(Just checking for writability on the network connections wouldn't be quite enough, of course, but that's another entry.)
The problem is that this doesn't necessarily work. You can easily have situations where one TCP stream will accept much more data than another one, or where all (or just most) TCP streams will accept only a modest amount of data at a time; this is especially likely when the TCP streams are connected to different processes on the remote end. If one remote process stalls its TCP stream can stop accepting much more data, at which point a large write to the stream may stall in turn, which stalls all amandad activity even if it could succeed, which stalls upstream activity that is trying to send data to amandad. What amandad's handling of multiplexing does is to put all of the data streams flowing through it at the mercy of whatever is the slowest write stream at any given point in time. If anything blocks everything can block, and sooner or later we seem to wind up in a situation where anything can and does block. The result is a stuttering stop-start process full of stalls that reduces the overall data flow significantly.
In short, what you don't want in good IO multiplexing is a situation
where IO on stream A has to wait because you're stalled doing IO on
stream B, and that is just what
amandad has arranged here.
Correct multiplexing is complex (even in the case of a single flow) but the core of it is never overcommitting yourself. What amandad should be doing is only writing as much output at a time as any particular connection can take, buffering some data internally, and passing any back pressure up to the things feeding it data (by stopping reading its inputs when it cannot write output and it has too much buffered). This would ensure that the multiplexing does no harm in that it can always proceed if any of the streams can proceed.
(Splitting apart the multiplexing into separate processes (or threads) does the same thing; because each one is only handling a single stream, that stream blocking doesn't block any other stream.)
PS: good IO multiplexing also needs to be fair, ie if if multiple
streams or file descriptors can all do IO at the current time then all
of them do some IO, instead of a constant stream of ready IO on one
stream starving IO for other streams. Generally the easiest way to do
this is to have your code always process all ready file descriptors
before returning to
poll(). This is also the more CPU-efficient way to