Wandering Thoughts archives


Round-trip capable character encodings in Python

Suppose (mostly hypothetically) that you have some bytes in an unknown encoding (or possibly in no encoding because they represent some binary data), but you need to pass them through some routine that only accepts Unicode strings. What you need is a character encoding that can round-trip arbitrary byte values through Unicode, one that can uniquely represent and reproduce all byte values from 0 to 255.

(This is not true of all character encodings. The 'ascii' codec is only defined on some bytes, for example, and a number of multi-byte encodings, most visibly UTF-8, only accept a subset of the possible byte sequences.)

For pure round-tripping we don't really care how the byte values are represented as Unicode code points, but it's often convenient if the mapping represents printable ASCII characters as their Unicode equivalent. Most convenient is a codec that represents each byte as the same 'byte' (ordinal codepoint) in Unicode, because it simplifies doing any debugging involving the Unicode string version; you can easily map between the bytes and the Unicode string. Let us call these two options a perfect mapping and a good mapping.

Somewhat to my surprise, Python has quite a number of character encodings that support round tripping and the vast majority of them are good mappings. There is only one perfect mapping, which is latin1.

(Since there's a perfect mapping I'm not going to bother listing all of the other ones.)

This was a bit of a surprise to me. Before I did this exercise I naively expected latin1 to be the only round-trippable character encoding (and I didn't expect it to be a perfect mapping), and I didn't certainly expect most of the other ones to be 'good' (ASCII-preserving) mappings. In retrospect this was silly; in the era of lots of 256-character encodings, they usually used all of the byte values and the printable ASCII characters were common to a lot of them for relatively obvious reasons.

On a side note, one of the surprising things that I discovered in going through this is that Python has no way of introspecting what codecs are available in a given Python install. I wound up copying the list from the codecs module documentation, which I sort of hope is generated automatically through some build system magic.

(That lack of introspection is part of why I am not putting my quick Python code that checks these codec properties into a sidebar; it's not really self-contained without a convenient list of codecs.)

python/RoundtripCodec written at 00:54:38; Add Comment

Page tools: See As Normal.
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.