What can go wrong with polling for writability on blocking sockets

September 13, 2014

Yesterday I wrote about how our performance problem with amandad were caused by amandad doing IO multiplexing wrong by only polling for whether it could read from its input file descriptors and assuming it could always write to its network sockets. But let's ask a question: suppose that amandad was also polling for writability on those network sockets. Would it work fine?

The answer is no, not without even more code changes, because amandad's network sockets aren't set to be non-blocking. The problem here is what it really means when poll() reports that something is ready for write (or for that matter, for read). Let me put it this way:

That poll() says a file descriptor is ready for writes doesn't mean that you can write an arbitrary amount of data to it without blocking.

When I put it this way, of course it can't. Can I write a gigabyte to a network socket or a pipe without blocking? Pretty much any kernel is going to say 'hell no'. Network sockets and pipes can never instantly absorb arbitrary amounts of data; there's always a limit somewhere. What poll()'s readiness indicator more or less means is that you can now write some data without blocking. How much data is uncertain.

The importance of non-blocking sockets is due to an API decision that Unix has made. Given that you can't write an arbitrary amount of data to a socket or a pipe without blocking, Unix has decided that by default when you write 'too much' you get blocked instead of getting a short write return (where you try to write N bytes and get told you wrote less than that). In order to not get blocked if you try a too large write you must explicitly set your file descriptor to non-blocking mode; at this point you will either get a short write or just an error (if you're trying to write and there is no room at all).

(This is a sensible API decision for reasons beyond the scope of this entry. And yes, it's not symmetric with reading from sockets and pipes.)

So if amandad just polled for writability but changed nothing else in its behavior, it would almost certainly still wind up blocking on writes to network sockets as it tried to stuff too much down them. At most it would wind up blocked somewhat less often because it would at least send some data immediately every time it tried to write to the network.

(The pernicious side of this particular bug is whether it bites you in any visible way depends on how much network IO you try to do how fast. If you send to the network (or to pipes) at a sufficiently slow rate, perhaps because your source of data is slow, you won't stall visibly on writes because there's always the capacity for how much data you're sending. Only when your send rates start overwhelming the receiver will you actively block in writes.)

Sidebar: The value of serendipity (even if I was wrong)

Yesterday I mentioned that my realization about the core cause of our amandad problem was sparked by remembering an apparently unrelated thing. As it happens, it was my memory of reading Rusty Russell's POLLOUT doesn't mean write(2) won't block: Part II that started me on this whole chain. A few rusty neurons woke up and said 'wait, poll() and then long write() waits? I was reading about that...' and off I went, even if my initial idea turned out to be wrong about the real cause. Had I not been reading Rusty Russell's blog I probably would have missed noticing the anomaly and as a result wasted a bunch of time at some point trying to figure out what the core problem was.

The write() issue is clearly in the air because Ewen McNeill also pointed it out in a comment on yesterday's entry. This is a good thing; the odd write behavior deserves to be better known so that it doesn't bite people.

Written on 13 September 2014.
« How not to do IO multiplexing, as illustrated by Amanda
My current hassles with Firefox, Flash, and (HTML5) video »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sat Sep 13 00:47:49 2014
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.