How not to do IO multiplexing, as illustrated by Amanda

September 12, 2014

Every so often I have a belated slow-motion realization about what's probably wrong with an otherwise mysterious problem. Sometimes this is even sparked by remembering an apparently unrelated thing I read in passing. As it happens, that happened the other day.

Let's rewind to this entry, where I wrote about what I'd discovered while looking into our slow Amanda backups. Specifically this was a dump of amandad handling multiple streams of backups at once, which we determined is the source of our slowness. In that entry I wrote in passing:

[Amandad is] also spending much more of its IO wait time writing the data rather than waiting for there to be more input, although the picture here is misleading because it's also making pollsys() calls and I wasn't tracking the time spent waiting in those [...]

This should have set off big alarm bells. If amandad is using poll(), why is it spending any appreciable amount of time waiting for writes to complete? After all the whole purpose of poll() et al is to only be woken when you can do work, so you should spend minimal time blocked in the actual IO functions. The unfortunate answer is that Amanda is doing IO multiplexing wrong, in that I believe it's only using poll() to check for readability on its input FDs, not writability on its output FDs. Instead it tacitly assumes that whenever it has data to read it can immediately write all of this data out with no fuss, muss, or delays.

(Just checking for writability on the network connections wouldn't be quite enough, of course, but that's another entry.)

The problem is that this doesn't necessarily work. You can easily have situations where one TCP stream will accept much more data than another one, or where all (or just most) TCP streams will accept only a modest amount of data at a time; this is especially likely when the TCP streams are connected to different processes on the remote end. If one remote process stalls its TCP stream can stop accepting much more data, at which point a large write to the stream may stall in turn, which stalls all amandad activity even if it could succeed, which stalls upstream activity that is trying to send data to amandad. What amandad's handling of multiplexing does is to put all of the data streams flowing through it at the mercy of whatever is the slowest write stream at any given point in time. If anything blocks everything can block, and sooner or later we seem to wind up in a situation where anything can and does block. The result is a stuttering stop-start process full of stalls that reduces the overall data flow significantly.

In short, what you don't want in good IO multiplexing is a situation where IO on stream A has to wait because you're stalled doing IO on stream B, and that is just what amandad has arranged here.

Correct multiplexing is complex (even in the case of a single flow) but the core of it is never overcommitting yourself. What amandad should be doing is only writing as much output at a time as any particular connection can take, buffering some data internally, and passing any back pressure up to the things feeding it data (by stopping reading its inputs when it cannot write output and it has too much buffered). This would ensure that the multiplexing does no harm in that it can always proceed if any of the streams can proceed.

(Splitting apart the multiplexing into separate processes (or threads) does the same thing; because each one is only handling a single stream, that stream blocking doesn't block any other stream.)

PS: good IO multiplexing also needs to be fair, ie if if multiple streams or file descriptors can all do IO at the current time then all of them do some IO, instead of a constant stream of ready IO on one stream starving IO for other streams. Generally the easiest way to do this is to have your code always process all ready file descriptors before returning to poll(). This is also the more CPU-efficient way to do it.


Comments on this page:

By Ewen McNeill at 2014-09-12 03:08:35:

It's possible (ie, I haven't gone looking at the source) that Amanda is actually poll()ing for "write possible" as well, but mistakenly assuming that write() behaves like read(), ie that it will do what it can, and then return. However without O_NONBLOCK, even when poll() indicates "write possible" a write of more than one byte may still block (and may also return without writing everything -- eg due to signals).

Even Linux kernel developers, who work on networking, can be surprised by this.

Ewen

PS: The solution is to set O_NONBLOCK if you want non-blocking behaviour. Even (especially!) if you're using select() or poll() or whatever to try to avoid blocking.

By mtk@acm.org at 2014-09-12 12:04:02:

i may be exposing my age :-) but i thought linux file systems always returned 'ready for writing' from select/poll? is that not the case any more?

By cks at 2014-09-12 14:12:22:

I initially thought that Amanda had the 'poll for writing on a blocking socket' problem and I was all set to write a nice entry about it, but then I made the mistake of looking at the source to be sure. It turns out that as far as I can tell amandad is really only polling for readability on its input pipes and then just blindly writing everything it gets to the output network sockets. The code is pretty explicit about its 'read input, write all of it out' behavior, although the actual polling behavior happens down in the depths of GLib and is a bit hard to sort out.

(That modern versions of Amanda use GLib was an annoying surprise to us. Among other things it makes them hard to compile on Solaris derived machines and ties Amanda versions to specific GLib versions.)

mtk@acm.org: You're correct about how polling on files behaves, but the situation here with Amanda doesn't have any of that; the daemon in question is just reading from pipes and writing to the network (both of which have useful poll() results).

Written on 12 September 2014.
« The cause of our slow Amanda backups and our workaround
What can go wrong with polling for writability on blocking sockets »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Sep 12 01:10:43 2014
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.