The problems with testing for round-trip capable codecs

October 21, 2010

In theory, testing a Python codec to see if it's round-trip capable is pretty simple:

def willtrip(cname):
   c1 = [chr(x) for x in range(0, 256)]
       uc = [x.decode(cname) for x in c1]
       c2 = [x.encode(cname) for x in uc]
   except UnicodeError:
       return False
   return c1 == c2

(Since it's been a while, I had to consult my notes to remember which of .decode() and .encode() I wanted to use when.)

When I started to think about this code more, I realized that it had a flaw; it assumed that the codec was relatively sane, or specifically that the codec was what I'll call a 'context-independent conversion'.

You can only use the results of willtrip() if you assume that the results of converting single bytes back and forth are the same as converting multi-byte strings. This is only true if two things are the case. First, that the codec always errors out if it sees an incomplete multi-byte sequence, and second, that it always converts the resulting Unicode code points back to the same byte values regardless of the surrounding Unicode code points.

(Well, regardless of the surrounding Unicode code points assuming that all of them are from upconverting bytes. But that's an assumption you have to make anyways when you're round-tripping stuff this way; you can never assume that inserting other Unicode codepoints is harmless.)

As far as I can tell, the codec module does not formally require that codecs have either of these properties, so in theory you could have a sufficiently perverse character encoding and Python codec for it. In practice I believe that all of the round-trip capable codecs are context independent in this way.

(I believe that some non-roundtripabble codecs are in fact context dependent this way, for example the ISO 2022 series.)

Sidebar: more on codec sanity

Out of curiosity, I put together some code to look for odd results in codec transformations. The results were disappointingly boring and sensible, with only a few odd things turning up:

  • a couple of codecs had non-reversible transformations, where a byte -> Unicode -> byte round trip mapped back to a different byte. cp875 is the big example; seven different bytes are all mapped to U+1A.

    (This appears to be because all of them are unused in cp875.)

  • iso2022_kr decodes two bytes (14 and 15) to the empty string. I believe that these are used as markers to shift to other character encodings within a string.

    (This doesn't happen in other ISO 2022 based codecs.)

I'd sort of hoped find at least one codec that did something strange like mapping a single byte to several Unicode codepoints, but no such luck.

(In addition to the above checks, what I looked for was a byte expanding to multiple Unicode codepoints, and a byte to Unicode codepoint either not mapping back, mapping to nothing, or mapping back to multiple bytes.)

Written on 21 October 2010.
« Round-trip capable character encodings in Python
My theory on Unix's one chance to have a standard GUI »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Thu Oct 21 00:11:21 2010
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.