2014-09-13
What can go wrong with polling for writability on blocking sockets
Yesterday I wrote about how our performance problem with amandad
were caused by amandad
doing IO
multiplexing wrong by only polling for
whether it could read from its input file descriptors and assuming
it could always write to its network sockets. But let's ask a question:
suppose that amandad
was also polling for writability on those network
sockets. Would it work fine?
The answer is no, not without even more code changes, because
amandad
's network sockets aren't set to be non-blocking. The
problem here is what it really means when poll()
reports that
something is ready for write (or for that matter, for read).
Let me put it this way:
That
poll()
says a file descriptor is ready for writes doesn't mean that you can write an arbitrary amount of data to it without blocking.
When I put it this way, of course it can't. Can I write a gigabyte
to a network socket or a pipe without blocking? Pretty much any
kernel is going to say 'hell no'. Network sockets and pipes can
never instantly absorb arbitrary amounts of data; there's always a
limit somewhere. What poll()
's readiness indicator more or less
means is that you can now write some data without blocking. How
much data is uncertain.
The importance of non-blocking sockets is due to an API decision that Unix has made. Given that you can't write an arbitrary amount of data to a socket or a pipe without blocking, Unix has decided that by default when you write 'too much' you get blocked instead of getting a short write return (where you try to write N bytes and get told you wrote less than that). In order to not get blocked if you try a too large write you must explicitly set your file descriptor to non-blocking mode; at this point you will either get a short write or just an error (if you're trying to write and there is no room at all).
(This is a sensible API decision for reasons beyond the scope of this entry. And yes, it's not symmetric with reading from sockets and pipes.)
So if amandad
just polled for writability but changed nothing
else in its behavior, it would almost certainly still wind up
blocking on writes to network sockets as it tried to stuff too
much down them. At most it would wind up blocked somewhat less
often because it would at least send some data immediately every
time it tried to write to the network.
(The pernicious side of this particular bug is whether it bites you in any visible way depends on how much network IO you try to do how fast. If you send to the network (or to pipes) at a sufficiently slow rate, perhaps because your source of data is slow, you won't stall visibly on writes because there's always the capacity for how much data you're sending. Only when your send rates start overwhelming the receiver will you actively block in writes.)
Sidebar: The value of serendipity (even if I was wrong)
Yesterday I mentioned that my realization about the core cause of
our amandad
problem was sparked by remembering an apparently
unrelated thing. As it happens, it was my memory of reading Rusty
Russell's POLLOUT doesn't mean write(2) won't block: Part II that started me on this whole
chain. A few rusty neurons woke up and said 'wait, poll()
and
then long write()
waits? I was reading about that...' and off I
went, even if my initial idea turned out to be wrong about the
real cause.
Had I not been reading Rusty Russell's blog I probably would have
missed noticing the anomaly and as a result wasted a bunch of time
at some point trying to figure out what the core problem was.
The write()
issue is clearly in the air because Ewen McNeill also
pointed it out in a comment on yesterday's entry. This is a good thing; the odd write
behavior deserves to be better known so that it doesn't bite people.