Why Python's struct module isn't great for handling binary protocols
On the surface, the struct module looks like the right answer if you need to handle a binary protocol. In practice it is not the right answer, or at least it isn't the complete answer, because it has some weaknesses and limitations. The issues are primarily but not entirely in the area of decoding binary messages as opposed to creating them ('deserialization' versus 'serialization').
Note that I am not blaming the struct module for this. Its documentation is very straightforward about what it's for (translating between Python bytestrings and basically fixed-size C structs); pressing it into service for binary protocols is a classic case of hitting a problem with whatever tools the Python standard library has at hand.
That said, it has three major weaknesses that I see:
- it doesn't support some common data types, most visibly null-terminated
C strings; in general it has no support for delimited fields instead
of counted ones. Delimited fields are quite common in binary protocols.
- it can't deal with the common case of self-describing variable length
structures. For example, consider the following protocol message:
byte cmd
uint16 len
byte data[len]
Here the message has a fixed size header that gives the length of the remaining data. Decoding this message with the struct module requires a two-step approach (where you have to decode the first two fields separately from the third), instead of a single-step one.
- it has no support for the higher-level 'parsing' that's required to
decide which particular protocol message you're seeing in the
input stream. Protocol messages often have common starting fields
that let you decode what message you're receiving and then variable
fields after that; with the struct module you again need a two-stage
decode process.
(Some protocols are sufficiently perverse that they need a multi-stage decode process because there are sub-message variants of particular messages with different numbers of fields.)
I put 'parsing' in quotes above because you can't really handle binary protocols with a traditional parsing approach (and certainly not with most traditional parser generators). Traditional parsing assumes that you have context-independent lexing, and protocol decoding like this is all about context-dependent lexing; you only know what the next field is once you've figured out the message so far. What you want is less a traditional parser and more a decision tree that's generated from a description of all of your messages (in well designed protocols, the decision tree is unambiguous).
(I think that the most natural looking parser for a binary protocol would be a recursive descent parser with no backtracking, but the parser code would not map well to the actual structure of the messages.)
You can sort of do this sort of protocol decoding with a combination of the struct module and regular expressions (for parsing delimited fields), but it's awkward. You're going to wind up with code that doesn't make the structure of the protocol messages very obvious, and in turn that opens up the chance for errors.
Encoding binary messages is easier than decoding them, but the lack of support for delimited fields again makes you use a multi-step process.
(A really great solution to this problem would even handle the case of self-describing variable length structures, where you can only know the size of something once you've encoded its sub-pieces.)
|
|