The problems with testing for round-trip capable codecs
In theory, testing a Python codec to see if it's round-trip capable is pretty simple:
def willtrip(cname): c1 = [chr(x) for x in range(0, 256)] try: uc = [x.decode(cname) for x in c1] c2 = [x.encode(cname) for x in uc] except UnicodeError: return False return c1 == c2
(Since it's been a while, I had to consult my notes
to remember which of .decode()
and .encode()
I wanted to use when.)
When I started to think about this code more, I realized that it had a flaw; it assumed that the codec was relatively sane, or specifically that the codec was what I'll call a 'context-independent conversion'.
You can only use the results of willtrip()
if you assume that the
results of converting single bytes back and forth are the same as
converting multi-byte strings. This is only true if two things are the
case. First, that the codec always errors out if it sees an incomplete
multi-byte sequence, and second, that it always converts the resulting
Unicode code points back to the same byte values regardless of the
surrounding Unicode code points.
(Well, regardless of the surrounding Unicode code points assuming that all of them are from upconverting bytes. But that's an assumption you have to make anyways when you're round-tripping stuff this way; you can never assume that inserting other Unicode codepoints is harmless.)
As far as I can tell, the codec module does not formally require that codecs have either of these properties, so in theory you could have a sufficiently perverse character encoding and Python codec for it. In practice I believe that all of the round-trip capable codecs are context independent in this way.
(I believe that some non-roundtripabble codecs are in fact context dependent this way, for example the ISO 2022 series.)
Sidebar: more on codec sanity
Out of curiosity, I put together some code to look for odd results in codec transformations. The results were disappointingly boring and sensible, with only a few odd things turning up:
- a couple of codecs had non-reversible transformations, where
a byte -> Unicode -> byte round trip mapped back to a different
byte. cp875 is the big example; seven different bytes are all
mapped to U+1A.
(This appears to be because all of them are unused in cp875.)
- iso2022_kr decodes two bytes (14 and 15) to the empty string.
I believe that these are used as markers to shift to other character
encodings within a string.
(This doesn't happen in other ISO 2022 based codecs.)
I'd sort of hoped find at least one codec that did something strange like mapping a single byte to several Unicode codepoints, but no such luck.
(In addition to the above checks, what I looked for was a byte expanding to multiple Unicode codepoints, and a byte to Unicode codepoint either not mapping back, mapping to nothing, or mapping back to multiple bytes.)
|
|