Why Python's struct module isn't great for handling binary protocols

October 26, 2010

On the surface, the struct module looks like the right answer if you need to handle a binary protocol. In practice it is not the right answer, or at least it isn't the complete answer, because it has some weaknesses and limitations. The issues are primarily but not entirely in the area of decoding binary messages as opposed to creating them ('deserialization' versus 'serialization').

Note that I am not blaming the struct module for this. Its documentation is very straightforward about what it's for (translating between Python bytestrings and basically fixed-size C structs); pressing it into service for binary protocols is a classic case of hitting a problem with whatever tools the Python standard library has at hand.

That said, it has three major weaknesses that I see:

  • it doesn't support some common data types, most visibly null-terminated C strings; in general it has no support for delimited fields instead of counted ones. Delimited fields are quite common in binary protocols.

  • it can't deal with the common case of self-describing variable length structures. For example, consider the following protocol message:
    byte cmd
    uint16 len
    byte data[len]

    Here the message has a fixed size header that gives the length of the remaining data. Decoding this message with the struct module requires a two-step approach (where you have to decode the first two fields separately from the third), instead of a single-step one.

  • it has no support for the higher-level 'parsing' that's required to decide which particular protocol message you're seeing in the input stream. Protocol messages often have common starting fields that let you decode what message you're receiving and then variable fields after that; with the struct module you again need a two-stage decode process.

    (Some protocols are sufficiently perverse that they need a multi-stage decode process because there are sub-message variants of particular messages with different numbers of fields.)

I put 'parsing' in quotes above because you can't really handle binary protocols with a traditional parsing approach (and certainly not with most traditional parser generators). Traditional parsing assumes that you have context-independent lexing, and protocol decoding like this is all about context-dependent lexing; you only know what the next field is once you've figured out the message so far. What you want is less a traditional parser and more a decision tree that's generated from a description of all of your messages (in well designed protocols, the decision tree is unambiguous).

(I think that the most natural looking parser for a binary protocol would be a recursive descent parser with no backtracking, but the parser code would not map well to the actual structure of the messages.)

You can sort of do this sort of protocol decoding with a combination of the struct module and regular expressions (for parsing delimited fields), but it's awkward. You're going to wind up with code that doesn't make the structure of the protocol messages very obvious, and in turn that opens up the chance for errors.

Encoding binary messages is easier than decoding them, but the lack of support for delimited fields again makes you use a multi-step process.

(A really great solution to this problem would even handle the case of self-describing variable length structures, where you can only know the size of something once you've encoded its sub-pieces.)

Comments on this page:

From at 2010-10-26 03:48:14:

I'm pretty sure the construct library, at http://construct.wikispaces.com/ is the kind of thing you're looking for. That is, unless blazing performance is important to you :-(. A high-performance version of construct would be awesome.

-- Gary Capell

From at 2010-10-26 09:50:37:

I found the same issues when using Python. I should have written a smart parser for my c/c++ communication headers, but I ended up writing strict code for each communication application I approached. I plan at looking at www.gccxml.org next time I have to do a communication application in Python

Written on 26 October 2010.
« Why we built our own ZFS spares handling system
A modern VCS mistake enabled by working on live trees »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Tue Oct 26 00:01:13 2010
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.