What I need for stream decoding and encoding binary protocols
I've recently started poking a bit at my milter project. The first part of this is to write something that can talk the milter protocol and thus decode and encode its binary messages. Writing binary parsers and generators by hand is annoying; ideally you'd write a declarative description of all of the binary messages, feed it to a package or a library, and get a codec back. So I gave it a spin with a Python package nominally intended for this purpose. The experience was educational but not successful, ultimately because I was asking the package to do things it wasn't really designed to do.
To summarize my views, the package was primarily designed as a packet oriented reader; by packet oriented I mean that it could assume that it had the whole object to decode and nothing but the object. What I need is a codec that works on streams, ideally without trusting them to be well-formed. This calls for some capabilities that the Python package was missing, and based on my experience to date here is what I think they are.
A stream capable decoder system needs to be able to do two additional things over a packet based decoder, because when you are dealing with streams you can have either extra data (from the next structure) or not enough data. Thus your decoder must be able to tell you how much of the buffer it consumed in the process of a successful structure decode, and it must be able to say 'structure incomplete, read more data and try again'.
(Stream decoding also puts some requirements on the message structure, but I don't think a package needs to check this explicitly. You're a protocol codec package, not a protocol verifier; you're entitled to assume that the protocol itself works for its declared purpose.)
What you need to be a genuine bidirectional codec instead of mostly a reader is some way to express two-way field dependencies. To make this concrete, I'll use the milter protocol itself. All milter messages have the following general form:
uint32 len char cmd char data[len-1]
When you decode this message, the data
portion depends on len
; the
Python package could express this dependency and automatically determine
how big data
was when parsing the message. But when you encode this
message the len
portion depends on data
, and the Python package
could not express this dependency (as far as I could see); I would have
had to compute it by hand and supply it explicitly when I asked the
package to build the encoded message.
(The Ruby BinData package has support for this, as an example of how it can be done.)
A good decoder checks that the structures are well formed. For example,
the data
contents for the milter message for a SMTP connection looks
like this:
char hostname[] char family uint16 port char address[]
(The hostname and address are null-terminated strings and may be empty.)
From this we can see that the minimum data
size of a properly formed
SMTP connection message is five bytes, and that there should not be
any data left over after you have reached the null byte that ends the
address
field. You want the decoder to both check these and make it
easy to express the constraint that for a particular type of message,
the data
field must contain exactly and only a certain structure.
(This again is a kind of dependency, this time between the uninterpreted
data
field and an interpreted version of it. In effect you want to
be able to define both the size of a field and the structure of its
contents. Although I did not dig deeply, the Python package seemed like
it could naturally do only one or the other unless I did a two stage
structure decode.)
|
|