== How I encode and decode the milter protocol (or, how to write a codec for a sane binary protocol) After [[all of my worrying and investigation of modules StructBinaryWeakness]] to handle protocols for me, I wound up writing [[an encoder and a decoder for sendmail's milter protocol PyMilterTools]] by hand because it was the simplest way. This is because the milter protocol is [[a sane binary protocol ../tech/SaneBinaryProtocols]] and it turns out that there's a straightforward way (at least in a dynamic language like Python) to write a codec for such a protocol. As a sane binary protocol, the milter protocol starts with a packet format: > uint32 len > char cmd > char data[len-1] The _cmd_ byte is the message type, which determines the structure of _data_. Each message has a fixed structure; there are some fixed number of fields, each of which is one of a small number of primitive field types. (All of this is what you'd expect for [[a sane binary protocol]].) First you need an encoder and a decoder for every primitive field type and some way that you can find them given a field type. I wound up with eight types for the milter protocol, although you could do it with seven, and I used a simple mapping dict: > codectypes = { > 'buf': (encode_buf, decode_buf), > 'u16': (encode_u16, decode_u16), > .... > } (Many of these routines were slight variants on each other; with the right support routines, actual encoders and decoders were mostly two lines per type. In the end I opted not to play [[fancy tricks NamespaceMetaclass]] with [[namespaces ClassesAsNamespaces]], partly because I like having simple two-line functions.) With the field types defined, we can now define each message as a sequence of named fields, each with a type. Again there are lots of ways to encode this data and I used brute force: > messages = { > SMFIC_HEADER: (('name', 'str'), > ('value', 'str')), > .... > } To decode a message you first read the entire packet (which you can do without knowing anything about the message's structure), then look up the _cmd_ in the _messages_ table to determine the message structure. For each field in the message, you decode an item of the given primitive type and store it under the field name; at the end of decoding, you should have nothing left un-decoded in _data_ (and you should not have run out). You return the _cmd_ byte and a dictionary of all of the fields. Encoding is the inverse process. You are given the _cmd_ byte and a message dictionary. You look up the message structure in _messages_, then walk the list; for each named field, you extract its value from the dictionary, encode it as the given primitive type, and concatenate the resulting raw bytes to your _data_. When the message is fully encoded, you determine _len_ and wrap the whole thing up as a packet. (My implementation of encoding took this a step further in laziness by using keyword arguments to the encoding function to create the message dictionary; you invoke it as _``encode_msg(SMFIC_HEADER, name="foo", value="bar")''_.) This requires minimal code and the code it does need is mostly generic. The actual process of encoding and decoding is data-driven; the protocol itself is basically specified in the _messages_ dictionary, and adding new messages is trivial as long as they use existing primitive field types. Repeated boilerplate code is basically completely eliminated. (This requires a dynamic language partly because it heavily relies on polymorphic argument handling and the ability to ship values around without the intermediate generic encoding and decoding layers having to care what type they are. If you had to do the usual strict static typing, you'd probably need a separate encoding function for each message and I'm not sure how you'd handle decoding.) On a side note, this means that I need to take back some of the nasty things I said about the milter protocol [[back a year ago ../spam/WhyMilters]]. Particularly, it does not have messages with a variable number of message fields. (I misread that part of the specification earlier.)