Why Python's struct module isn't great for handling binary protocols
October 26, 2010
On the surface, the struct module looks like the right answer if you need to handle a binary protocol. In practice it is not the right answer, or at least it isn't the complete answer, because it has some weaknesses and limitations. The issues are primarily but not entirely in the area of decoding binary messages as opposed to creating them ('deserialization' versus 'serialization').
Note that I am not blaming the struct module for this. Its documentation is very straightforward about what it's for (translating between Python bytestrings and basically fixed-size C structs); pressing it into service for binary protocols is a classic case of hitting a problem with whatever tools the Python standard library has at hand.
That said, it has three major weaknesses that I see:
I put 'parsing' in quotes above because you can't really handle binary protocols with a traditional parsing approach (and certainly not with most traditional parser generators). Traditional parsing assumes that you have context-independent lexing, and protocol decoding like this is all about context-dependent lexing; you only know what the next field is once you've figured out the message so far. What you want is less a traditional parser and more a decision tree that's generated from a description of all of your messages (in well designed protocols, the decision tree is unambiguous).
(I think that the most natural looking parser for a binary protocol would be a recursive descent parser with no backtracking, but the parser code would not map well to the actual structure of the messages.)
You can sort of do this sort of protocol decoding with a combination of the struct module and regular expressions (for parsing delimited fields), but it's awkward. You're going to wind up with code that doesn't make the structure of the protocol messages very obvious, and in turn that opens up the chance for errors.
Encoding binary messages is easier than decoding them, but the lack of support for delimited fields again makes you use a multi-step process.
(A really great solution to this problem would even handle the case of self-describing variable length structures, where you can only know the size of something once you've encoded its sub-pieces.)
Written on 26 October 2010.
* * *