Wandering Thoughts archives

2010-10-26

Why Python's struct module isn't great for handling binary protocols

On the surface, the struct module looks like the right answer if you need to handle a binary protocol. In practice it is not the right answer, or at least it isn't the complete answer, because it has some weaknesses and limitations. The issues are primarily but not entirely in the area of decoding binary messages as opposed to creating them ('deserialization' versus 'serialization').

Note that I am not blaming the struct module for this. Its documentation is very straightforward about what it's for (translating between Python bytestrings and basically fixed-size C structs); pressing it into service for binary protocols is a classic case of hitting a problem with whatever tools the Python standard library has at hand.

That said, it has three major weaknesses that I see:

  • it doesn't support some common data types, most visibly null-terminated C strings; in general it has no support for delimited fields instead of counted ones. Delimited fields are quite common in binary protocols.

  • it can't deal with the common case of self-describing variable length structures. For example, consider the following protocol message:
    byte cmd
    uint16 len
    byte data[len]

    Here the message has a fixed size header that gives the length of the remaining data. Decoding this message with the struct module requires a two-step approach (where you have to decode the first two fields separately from the third), instead of a single-step one.

  • it has no support for the higher-level 'parsing' that's required to decide which particular protocol message you're seeing in the input stream. Protocol messages often have common starting fields that let you decode what message you're receiving and then variable fields after that; with the struct module you again need a two-stage decode process.

    (Some protocols are sufficiently perverse that they need a multi-stage decode process because there are sub-message variants of particular messages with different numbers of fields.)

I put 'parsing' in quotes above because you can't really handle binary protocols with a traditional parsing approach (and certainly not with most traditional parser generators). Traditional parsing assumes that you have context-independent lexing, and protocol decoding like this is all about context-dependent lexing; you only know what the next field is once you've figured out the message so far. What you want is less a traditional parser and more a decision tree that's generated from a description of all of your messages (in well designed protocols, the decision tree is unambiguous).

(I think that the most natural looking parser for a binary protocol would be a recursive descent parser with no backtracking, but the parser code would not map well to the actual structure of the messages.)

You can sort of do this sort of protocol decoding with a combination of the struct module and regular expressions (for parsing delimited fields), but it's awkward. You're going to wind up with code that doesn't make the structure of the protocol messages very obvious, and in turn that opens up the chance for errors.

Encoding binary messages is easier than decoding them, but the lack of support for delimited fields again makes you use a multi-step process.

(A really great solution to this problem would even handle the case of self-describing variable length structures, where you can only know the size of something once you've encoded its sub-pieces.)

StructBinaryWeakness written at 00:01:13; Add Comment

2010-10-21

The problems with testing for round-trip capable codecs

In theory, testing a Python codec to see if it's round-trip capable is pretty simple:

def willtrip(cname):
   c1 = [chr(x) for x in range(0, 256)]
   try:
       uc = [x.decode(cname) for x in c1]
       c2 = [x.encode(cname) for x in uc]
   except UnicodeError:
       return False
   return c1 == c2

(Since it's been a while, I had to consult my notes to remember which of .decode() and .encode() I wanted to use when.)

When I started to think about this code more, I realized that it had a flaw; it assumed that the codec was relatively sane, or specifically that the codec was what I'll call a 'context-independent conversion'.

You can only use the results of willtrip() if you assume that the results of converting single bytes back and forth are the same as converting multi-byte strings. This is only true if two things are the case. First, that the codec always errors out if it sees an incomplete multi-byte sequence, and second, that it always converts the resulting Unicode code points back to the same byte values regardless of the surrounding Unicode code points.

(Well, regardless of the surrounding Unicode code points assuming that all of them are from upconverting bytes. But that's an assumption you have to make anyways when you're round-tripping stuff this way; you can never assume that inserting other Unicode codepoints is harmless.)

As far as I can tell, the codec module does not formally require that codecs have either of these properties, so in theory you could have a sufficiently perverse character encoding and Python codec for it. In practice I believe that all of the round-trip capable codecs are context independent in this way.

(I believe that some non-roundtripabble codecs are in fact context dependent this way, for example the ISO 2022 series.)

Sidebar: more on codec sanity

Out of curiosity, I put together some code to look for odd results in codec transformations. The results were disappointingly boring and sensible, with only a few odd things turning up:

  • a couple of codecs had non-reversible transformations, where a byte -> Unicode -> byte round trip mapped back to a different byte. cp875 is the big example; seven different bytes are all mapped to U+1A.

    (This appears to be because all of them are unused in cp875.)

  • iso2022_kr decodes two bytes (14 and 15) to the empty string. I believe that these are used as markers to shift to other character encodings within a string.

    (This doesn't happen in other ISO 2022 based codecs.)

I'd sort of hoped find at least one codec that did something strange like mapping a single byte to several Unicode codepoints, but no such luck.

(In addition to the above checks, what I looked for was a byte expanding to multiple Unicode codepoints, and a byte to Unicode codepoint either not mapping back, mapping to nothing, or mapping back to multiple bytes.)

RoundtripCodecTesting written at 00:11:21; Add Comment

2010-10-20

Round-trip capable character encodings in Python

Suppose (mostly hypothetically) that you have some bytes in an unknown encoding (or possibly in no encoding because they represent some binary data), but you need to pass them through some routine that only accepts Unicode strings. What you need is a character encoding that can round-trip arbitrary byte values through Unicode, one that can uniquely represent and reproduce all byte values from 0 to 255.

(This is not true of all character encodings. The 'ascii' codec is only defined on some bytes, for example, and a number of multi-byte encodings, most visibly UTF-8, only accept a subset of the possible byte sequences.)

For pure round-tripping we don't really care how the byte values are represented as Unicode code points, but it's often convenient if the mapping represents printable ASCII characters as their Unicode equivalent. Most convenient is a codec that represents each byte as the same 'byte' (ordinal codepoint) in Unicode, because it simplifies doing any debugging involving the Unicode string version; you can easily map between the bytes and the Unicode string. Let us call these two options a perfect mapping and a good mapping.

Somewhat to my surprise, Python has quite a number of character encodings that support round tripping and the vast majority of them are good mappings. There is only one perfect mapping, which is latin1.

(Since there's a perfect mapping I'm not going to bother listing all of the other ones.)

This was a bit of a surprise to me. Before I did this exercise I naively expected latin1 to be the only round-trippable character encoding (and I didn't expect it to be a perfect mapping), and I didn't certainly expect most of the other ones to be 'good' (ASCII-preserving) mappings. In retrospect this was silly; in the era of lots of 256-character encodings, they usually used all of the byte values and the printable ASCII characters were common to a lot of them for relatively obvious reasons.

On a side note, one of the surprising things that I discovered in going through this is that Python has no way of introspecting what codecs are available in a given Python install. I wound up copying the list from the codecs module documentation, which I sort of hope is generated automatically through some build system magic.

(That lack of introspection is part of why I am not putting my quick Python code that checks these codec properties into a sidebar; it's not really self-contained without a convenient list of codecs.)

RoundtripCodec written at 00:54:38; Add Comment

2010-10-18

A module that I want in the standard library: Parsing

Given what I've written before, it will probably not surprise you that I have a little list of modules that I think should be in the standard library. The first one of them is a decent parsing or parser generator module.

There are two reasons in combination for this. First off, various sorts of parsing is an important part of the work that a decent number of programs do. Full bore language parsers are somewhat uncommon, but (at least in my work) it's reasonably common to be parsing (or at least looking at) less complex things, including the output of other programs. The Python standard library currently has a number of specialized parsers (for XML, for certain standard sorts of configuration files, and so on), but it has no general parser framework.

But wait, you might be saying, Python has a general parser framework in the form of regular expressions. That is the second reason that a real parser module is needed; to quote Jamie Zawinski, using regular expressions for parsing means that you now have two problems. Trying to do parsing with regular expressions is the classic example of using a bad solution to the problem. Regular expressions do not help you much if you want to do a real job of parsing something, but they make it very easy to assemble an incomplete, difficult to maintain, and fragile 'parser' that is likely to mishandle any input that it doesn't expect (and there have been lots of demonstrations of this). But as long as the standard library does have a regular expression module and doesn't have a parser module, people will continue building parsers using regular expressions and then blowing their feet off.

(Regular expressions are useful for lexing, but that is only a part of parsing things. My ideal parsing module would also have good support for building lexers, possibly integrated into the grammar specification.)

A good parsing module for the standard library would make it easier to build a real parser than to parse things with regular expressions, thereby encouraging people to solve their problems in the good, right way. (I would suggest that things like good error checking and error recovery would also be attractive, but people who are happy with regular expression based parsers are unlikely to be missing them now.)

Python has a number of (third party) parsing modules, none of which I've tried out. Ned Batchelder's comprehensive index and summary is the best resource on this that I know of.

(Since I have tried none of the existing modules, I have no opinion on which one should be in the standard library, if any. I just want some decent parsing module to be in there.)

StandardParsing written at 22:00:42; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.