How I encode and decode the milter protocol (or, how to write a codec for a sane binary protocol)
After all of my worrying and investigation of modules to handle protocols for me, I wound up writing an encoder and a decoder for sendmail's milter protocol by hand because it was the simplest way. This is because the milter protocol is a sane binary protocol and it turns out that there's a straightforward way (at least in a dynamic language like Python) to write a codec for such a protocol.
As a sane binary protocol, the milter protocol starts with a packet format:
uint32 len char cmd char data[len-1]
The cmd
byte is the message type, which determines the structure of
data
. Each message has a fixed structure; there are some fixed number
of fields, each of which is one of a small number of primitive field
types. (All of this is what you'd expect for a sane binary protocol.)
First you need an encoder and a decoder for every primitive field type and some way that you can find them given a field type. I wound up with eight types for the milter protocol, although you could do it with seven, and I used a simple mapping dict:
codectypes = { 'buf': (encode_buf, decode_buf), 'u16': (encode_u16, decode_u16), .... }
(Many of these routines were slight variants on each other; with the right support routines, actual encoders and decoders were mostly two lines per type. In the end I opted not to play fancy tricks with namespaces, partly because I like having simple two-line functions.)
With the field types defined, we can now define each message as a sequence of named fields, each with a type. Again there are lots of ways to encode this data and I used brute force:
messages = { SMFIC_HEADER: (('name', 'str'), ('value', 'str')), .... }
To decode a message you first read the entire packet (which you can do
without knowing anything about the message's structure), then look up
the cmd
in the messages
table to determine the message structure.
For each field in the message, you decode an item of the given primitive
type and store it under the field name; at the end of decoding, you
should have nothing left un-decoded in data
(and you should not have
run out). You return the cmd
byte and a dictionary of all of the
fields.
Encoding is the inverse process. You are given the cmd
byte and a
message dictionary. You look up the message structure in messages
,
then walk the list; for each named field, you extract its value from the
dictionary, encode it as the given primitive type, and concatenate the
resulting raw bytes to your data
. When the message is fully encoded,
you determine len
and wrap the whole thing up as a packet.
(My implementation of encoding took this a step further in laziness by
using keyword arguments to the encoding function to create the message
dictionary; you invoke it as encode_msg(SMFIC_HEADER, name="foo",
value="bar")
.)
This requires minimal code and the code it does need is mostly generic.
The actual process of encoding and decoding is data-driven; the protocol
itself is basically specified in the messages
dictionary, and adding
new messages is trivial as long as they use existing primitive field
types. Repeated boilerplate code is basically completely eliminated.
(This requires a dynamic language partly because it heavily relies on polymorphic argument handling and the ability to ship values around without the intermediate generic encoding and decoding layers having to care what type they are. If you had to do the usual strict static typing, you'd probably need a separate encoding function for each message and I'm not sure how you'd handle decoding.)
On a side note, this means that I need to take back some of the nasty things I said about the milter protocol back a year ago. Particularly, it does not have messages with a variable number of message fields. (I misread that part of the specification earlier.)
|
|