Why not doing Unicode is easier than doing Unicode in Python
Okay, I have a confession: I continue to write Python programs that deal with text (to some degree) but which are not Unicode-aware in the approved manner. One of the reasons that I do this is that I can get away with it, but another reason is that not doing Unicode is easier than doing Unicode.
I don't mean this in the sense that there's less code if you don't do Unicode. I mean that doing Unicode confronts you with more decisions. When you do Unicode, you must convert between Unicode and encoded strings, which means that you must decide what to do when a conversion to or from Unicode fails. If you do not do Unicode, if you just slop plain strings around, at least your program will not explode, or mangle input (too much), or sprinkle ?'s all over things.
(Having to make these decisions is especially irritating if you are just passing the text through unaltered, which is a common case for me. And yes, in theory you'll only be dealing with well-encoded text and you can ignore all of this, but I am cautious enough to expect that sooner or later my programs will be fed text that is in a different encoding than the system locale or is otherwise mangled.)
It does not help that it is relatively hard to control what Python does
when a Unicode error happens. For example, there is no handy 'call this
function to handle unconvertible things' argument for .encode()
.
While you can register an error handler function with the codecs
module, the interfaces involved make it pretty clear that this is not
intended as a simple, casual thing.
(There are decent standard options for converting from Unicode back to text, in the "xmlcharrefreplace" and "backslashreplace" conversion options, but the standard options for converting from random mangled text to Unicode seem sparser and less satisfying; I would at least like to know what was replaced.)
|
|