Wandering Thoughts archives

2011-06-20

It's long since time for languages to provide sensible network IO

You may not have noticed, but stream-based network IO is famously not like regular file IO. Well, mostly network read IO. Believing that it is is a common socket programming error.

A great deal of the time, programs and programmers do not want to care about this. When they say 'read ten bytes', they want to get back exactly ten bytes (assuming that the stream is not closed on them) and they do not care how long they have to wait for those ten bytes to all show up. In most cases, if you return less than ten bytes all they will do is turn around and immediately try to read more bytes.

(Programmers who actively want short reads tend to set their network sockets to nonblocking mode and do other special things.)

By now it is far too late to change the base behavior of systems. But if your language has a network IO package, it is high time that you provided not just a read() operation but also a readall() operation to go with it. In fact, I would go so far as to say that what you should really provide is read() and readshort(); readshort() would be the current 'read whatever the OS says is there', and read() would do 'read all' unless the socket was in non-blocking mode.

(If you have a moral objection to doing this in your basic network support layer, please provide it as a common mixin or simple additional layer that is easy to stack on top of a bare network socket.)

Such an interface would be an error-shielding interface. It would avoid a class of errors, or at least avoid requiring programmers to repeatedly write the same code to do network re-reading correctly. In almost all cases the overall code would get simpler; either the higher layer wouldn't have to worry about this any more or programmers using the higher layer on network streams wouldn't have to shim something to do re-reads in between the higher layer and the actual network socket.

It's quite possible that making this change would expose the fact that you need additional interfaces, for example a 'read until you see character X' interface for simple (text) line-based network protocols like SMTP.

(Simple servers for such protocols mostly work today because read() almost always returns a full line and only a single line.)

A similar thing can and should be done for write() on any platform where a write to a blocking network socket doesn't necessarily write out all of the data. (This is not all of them.)

(This grump is brought to you by me having to deal with this issue yet again.)

By the way and as a side note, this means that a great many simple examples of network programming in high level languages are wrong. By ignoring this issue (either through ignorance or a desire for simplicity) they lure programmers into writing subtly erroneous network code, code that works most of the time but not always.

(That not ignoring this issue would make your simple network IO examples 'too complex' is in fact a sign that your language needs a simpler interface to this issue.)

programming/SensibleNetworkIO written at 13:18:08; Add Comment

What I need for stream decoding and encoding binary protocols

I've recently started poking a bit at my milter project. The first part of this is to write something that can talk the milter protocol and thus decode and encode its binary messages. Writing binary parsers and generators by hand is annoying; ideally you'd write a declarative description of all of the binary messages, feed it to a package or a library, and get a codec back. So I gave it a spin with a Python package nominally intended for this purpose. The experience was educational but not successful, ultimately because I was asking the package to do things it wasn't really designed to do.

To summarize my views, the package was primarily designed as a packet oriented reader; by packet oriented I mean that it could assume that it had the whole object to decode and nothing but the object. What I need is a codec that works on streams, ideally without trusting them to be well-formed. This calls for some capabilities that the Python package was missing, and based on my experience to date here is what I think they are.

A stream capable decoder system needs to be able to do two additional things over a packet based decoder, because when you are dealing with streams you can have either extra data (from the next structure) or not enough data. Thus your decoder must be able to tell you how much of the buffer it consumed in the process of a successful structure decode, and it must be able to say 'structure incomplete, read more data and try again'.

(Stream decoding also puts some requirements on the message structure, but I don't think a package needs to check this explicitly. You're a protocol codec package, not a protocol verifier; you're entitled to assume that the protocol itself works for its declared purpose.)

What you need to be a genuine bidirectional codec instead of mostly a reader is some way to express two-way field dependencies. To make this concrete, I'll use the milter protocol itself. All milter messages have the following general form:

uint32   len
char     cmd
char     data[len-1]

When you decode this message, the data portion depends on len; the Python package could express this dependency and automatically determine how big data was when parsing the message. But when you encode this message the len portion depends on data, and the Python package could not express this dependency (as far as I could see); I would have had to compute it by hand and supply it explicitly when I asked the package to build the encoded message.

(The Ruby BinData package has support for this, as an example of how it can be done.)

A good decoder checks that the structures are well formed. For example, the data contents for the milter message for a SMTP connection looks like this:

char     hostname[]
char     family
uint16   port
char     address[]

(The hostname and address are null-terminated strings and may be empty.)

From this we can see that the minimum data size of a properly formed SMTP connection message is five bytes, and that there should not be any data left over after you have reached the null byte that ends the address field. You want the decoder to both check these and make it easy to express the constraint that for a particular type of message, the data field must contain exactly and only a certain structure.

(This again is a kind of dependency, this time between the uninterpreted data field and an interpreted version of it. In effect you want to be able to define both the size of a field and the structure of its contents. Although I did not dig deeply, the Python package seemed like it could naturally do only one or the other unless I did a two stage structure decode.)

programming/ProtocolCodecNeeds written at 01:07:02; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.