Why a network connection becoming writable when it succeeds makes sense

January 19, 2020

When I talked about how Go deals with canceling network connection attempts, I mentioned that it's common for the underlying operating system to signal you that a TCP connection (or more generally a network connection) has been successfully made by letting it become writable. On the surface this sounds odd, and to some degree it is, but it also falls out of what the operating system knows about a network connection before and after it's made. Also, in practice there is a certain amount of history tied up in this particular interface.

If we start out thinking about being told about events, we can ask what events you would see when a TCP connection finishes the three way handshake and becomes established. The connection is now established (one event), and you can generally now send data to the remote end, but usually there's no data from the remote end to receive so you would not get an event for that. So we would expect a 'connection is established' event and a 'you can send data' event. If we want a more compact encoding of events, it's quite tempting to merge these two together into one event and say that a new TCP connection becoming writable is a sign that its three way handshake has now completed.

(And you certainly wouldn't expect to see a 'you can send data' event before the three way handshake finishes.)

The history is that a lot of the fundamental API of asynchronous network IO comes from BSD Unix and spread from there (even to non-Unix systems, for various reasons). BSD Unix did not use a more complex 'stream of events' API to communicate information from the kernel to your program; instead it used simple and easy to implement kernel APIs (because this was the early 1980s). The BSD Unix API was select(), which passes information back and forth using bitmaps; one bitmap for sending data, one bitmap for receiving data, and one bitmap for 'exceptions' (whatever they are). In this API, the simplest way for the kernel to tell programs that the three way handshake has finished is to set the relevant bit in the 'you can send data' bitmap. The kernel's got to set that bit anyway, and if it sets that bit and also sets a bit in the 'exceptions' bitmap it needs to do more work (and so will programs; in fact some of them will just rely on the writability signal, because it's simpler for them).

Once you're doing this for TCP connections, it generally makes sense for all connections regardless of type. There are likely to be very few stream connection types where it makes sense to signal that you can now send (more) data partway through the connection being established, and that's the only case where this use of signaling writability gets in the way.

Comments on this page:

By Joker_vD at 2020-01-20 07:13:38:

Ah, but the "on successful connection establishment, socket becomes writable" is not the whole story — how do you signal "failed to establish connection, socket is now junk"?

I don't know how the original BSD did this, but I know how modern Linux and Windows do this — of course, they do it differently. And of course, they both don't conform to POSIX.

So, what does Linux do? When connect(3) fails, Linux marks the socket as both readable and writeable, but without any exceptions pending (i.e., it's reported in readfds and writefds, but not in errorfds). The rationale for that, I believe, is that reading/writing from/to such a socket don't block but fail immediately — fair enough. But why is this socket not reported in errfds? POSIX clearly states that connect's failure makes socket to have a pending exception that must be reported by select(3) in errfds... So anyway, having your socket in writefds only means that the connection attempt finished one way or another, but doesn't tell you if it succeeded or not.

On Windows, the situation is entirely opposite: on connect's failure, the socket is marked as non-readable, non-writeable, with an exception pending, i.e., reported only in errfds). I guess the rationale was that that'd allow one to easily ditinguish successful connection attempts from failed ones: if it's writeable, it's established, if it's has exception pending, it's failed; and these two conditions are mutually exclusive.

So how does one distinguish the connection failure from success on Linux? One could try to checking readability but it's unreliable because the remote host could've sent you some data in the SYN-response, so the proper way is to use getsockopt(sock, SOL_SOCKET, SO_ERROR). In fact, the same call is required on Windows if you want to learn the precise error, so that's what you should do on both platforms anyway, probably without even bothering to check writefds/errorfds — it's guaranteed to return 0 after successful connection attempt.

It seems that this is piece of trivia is something that almost every implementor of a new-fangled multiplatform network library learns only after Windows users report "establishing connection to unavailable hosts hangs for a really bloody long time, please fix", and yet I've never seen a sinlge blogpost about it.

Written on 19 January 2020.
« CUPS's page log, its use of SNMP, and (probably) why CUPS PPDs turn that off
Python 2, Apache's mod_wsgi, and its future in Linux distributions »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sun Jan 19 01:09:43 2020
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.