Why not doing Unicode is easier than doing Unicode in Python

November 12, 2008

Okay, I have a confession: I continue to write Python programs that deal with text (to some degree) but which are not Unicode-aware in the approved manner. One of the reasons that I do this is that I can get away with it, but another reason is that not doing Unicode is easier than doing Unicode.

I don't mean this in the sense that there's less code if you don't do Unicode. I mean that doing Unicode confronts you with more decisions. When you do Unicode, you must convert between Unicode and encoded strings, which means that you must decide what to do when a conversion to or from Unicode fails. If you do not do Unicode, if you just slop plain strings around, at least your program will not explode, or mangle input (too much), or sprinkle ?'s all over things.

(Having to make these decisions is especially irritating if you are just passing the text through unaltered, which is a common case for me. And yes, in theory you'll only be dealing with well-encoded text and you can ignore all of this, but I am cautious enough to expect that sooner or later my programs will be fed text that is in a different encoding than the system locale or is otherwise mangled.)

It does not help that it is relatively hard to control what Python does when a Unicode error happens. For example, there is no handy 'call this function to handle unconvertible things' argument for .encode(). While you can register an error handler function with the codecs module, the interfaces involved make it pretty clear that this is not intended as a simple, casual thing.

(There are decent standard options for converting from Unicode back to text, in the "xmlcharrefreplace" and "backslashreplace" conversion options, but the standard options for converting from random mangled text to Unicode seem sparser and less satisfying; I would at least like to know what was replaced.)

Written on 12 November 2008.
« Another attempt to split SSL into encryption and trust
What the members of a Unicode conversion error object are »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed Nov 12 01:13:42 2008
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.