Wandering Thoughts archives

2011-08-03

How I encode and decode the milter protocol (or, how to write a codec for a sane binary protocol)

After all of my worrying and investigation of modules to handle protocols for me, I wound up writing an encoder and a decoder for sendmail's milter protocol by hand because it was the simplest way. This is because the milter protocol is a sane binary protocol and it turns out that there's a straightforward way (at least in a dynamic language like Python) to write a codec for such a protocol.

As a sane binary protocol, the milter protocol starts with a packet format:

uint32  len
char    cmd
char    data[len-1]

The cmd byte is the message type, which determines the structure of data. Each message has a fixed structure; there are some fixed number of fields, each of which is one of a small number of primitive field types. (All of this is what you'd expect for a sane binary protocol.)

First you need an encoder and a decoder for every primitive field type and some way that you can find them given a field type. I wound up with eight types for the milter protocol, although you could do it with seven, and I used a simple mapping dict:

codectypes = {
  'buf': (encode_buf, decode_buf),
  'u16': (encode_u16, decode_u16),
  ....
}

(Many of these routines were slight variants on each other; with the right support routines, actual encoders and decoders were mostly two lines per type. In the end I opted not to play fancy tricks with namespaces, partly because I like having simple two-line functions.)

With the field types defined, we can now define each message as a sequence of named fields, each with a type. Again there are lots of ways to encode this data and I used brute force:

messages = {
  SMFIC_HEADER: (('name', 'str'),
                 ('value', 'str')),
  ....
}

To decode a message you first read the entire packet (which you can do without knowing anything about the message's structure), then look up the cmd in the messages table to determine the message structure. For each field in the message, you decode an item of the given primitive type and store it under the field name; at the end of decoding, you should have nothing left un-decoded in data (and you should not have run out). You return the cmd byte and a dictionary of all of the fields.

Encoding is the inverse process. You are given the cmd byte and a message dictionary. You look up the message structure in messages, then walk the list; for each named field, you extract its value from the dictionary, encode it as the given primitive type, and concatenate the resulting raw bytes to your data. When the message is fully encoded, you determine len and wrap the whole thing up as a packet.

(My implementation of encoding took this a step further in laziness by using keyword arguments to the encoding function to create the message dictionary; you invoke it as encode_msg(SMFIC_HEADER, name="foo", value="bar").)

This requires minimal code and the code it does need is mostly generic. The actual process of encoding and decoding is data-driven; the protocol itself is basically specified in the messages dictionary, and adding new messages is trivial as long as they use existing primitive field types. Repeated boilerplate code is basically completely eliminated.

(This requires a dynamic language partly because it heavily relies on polymorphic argument handling and the ability to ship values around without the intermediate generic encoding and decoding layers having to care what type they are. If you had to do the usual strict static typing, you'd probably need a separate encoding function for each message and I'm not sure how you'd handle decoding.)

On a side note, this means that I need to take back some of the nasty things I said about the milter protocol back a year ago. Particularly, it does not have messages with a variable number of message fields. (I misread that part of the specification earlier.)

python/HowMilterCodec written at 01:14:26; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.