My general issue with Unicode in Python 3
July 15, 2012
I've written a number of things that amount to grumbles that Python 3 forces mandatory Unicode handling down people's throats where before I didn't need to deal with encoding issues. To start with it's worth explaining why I could say this with a straight face.
I could get away with this ignorance because my programs almost invariably work in a particular way. What they process is a mixture of ASCII (for keywords, directives, and so on, all of the things the program had to interpret) and uninterpreted bytestrings, which are simply regurgitated to the user as-is in appropriate situations. Since the bytestrings are simply repeated verbatim (without any alteration), my code doesn't need to know or care what encoding they're in; in fact, attempting to decode the bytestrings to Unicode and then re-encode them for output introduces two new failure points.
(Related to this is the pattern where a Unix program doesn't care what encoding its command line arguments are in because once again they're being used as uninterpreted bytestrings for, eg, filenames. Forcing the program or runtime environment to decode these to Unicde then adds heartburn and a potential failure point for no gain.)
The writing and attitudes around early versions of Python 3 made it clear that you weren't supposed to do this any more. The new Pythonic way to operate on 'strings' was to decode them to Unicode immediately and then work in Unicode, re-encoding on output. Where operating on plain un-decoded strings was even possible it was made at least somewhat annoying, partly by limiting how much you could do with such strings. Current versions of Python 3 seem to have relaxed a little bit but all sorts of things still push you in the 'decode immediately' direction.
All of this changed focus to 'decode immediately' was (and for that matter still is) irritating to me. If you decode you have to deal with encoding issues, which means that now my programs could blow up parsing their configuration files or the like. This struck me as a lot like the experience of strict error handling in XHTML (where if anything anywhere went wrong you got nothing).
(Forced decoding turns out to cause all sorts of bad problems on Unix, because fundamental Unix interfaces are byte-string interfaces.)
What I've written here may sound seductive and reasonable but there's a gotcha to it, an inobvious bit of arrogance and blindness that I've only slowly woken up to recently. Since this entry is already long enough, that's a topic for the next entry.
Sidebar: about encoding mistakes in those uninterpreted bytestrings
My view of potential encoding mistakes in those uninterpreted bytestrings is a pragmatic one: if I'm just repeating them verbatim, any garbage output that results is not my problem and not my fault. My program is simply doing what you told it to do.
* * *
Atom feeds are available; see the bottom of most pages.