Python's dangerous automatic Unicode conversions

August 29, 2005

I'm a little bit late to the Unicode in Python debate party; Ian Bicking and others were writing about it back in early August, for example here and here. By now, lots of people around the Python blog community have weighed in on the high-level issues. My grumbles about Python's Unicode support are more low-level, like the problem with Unicode in hashes that Ian Bicking found.

Specifically, I think that Python is too helpful with automatic conversions between Unicode strings and byte strings, and it would be easier to get fully reliable programs if there were less such automatic conversions.

The problem with Python's automatic conversions now is that they work most of the time but not all of the time. This in turn creates programs with subtle, hard to find bugs that may not crop up for some time, especially because some of the bugs are consequences of sensible or necessary decisions made in the standard library.

(All of the following examples are drawn from real code that people approached the #python IRC channel for help with.)

Our first contestant is a classic:

flist = os.listdir(u".")
flist.sort()

If you give os.listdir a Unicode string as the directory to list, it tries to return a list of Unicode strings. However, if there is a filename in the directory that cannot be encoded in Unicode, it is returned as-is as a bytestring. .sort() on such a mixed list attempts to convert the bytestring into Unicode, but this fails.

Result: our friend UnicodeDecodeError. But only if the directory has a mix of (usually) plain ASCII and encoded filenames. All plain ASCII or all encoded, and you're fine.

The second example cropped up in someone's web site package, roughly as:

headers = []
headers.append(u"Something: Here.")
...
from time import localtime, strftime
tstr = strftime("Last modified: %c %Z", \
                localtime())
headers.append(tstr)
print "\r\n".join(headers)

The problem: in some locales, the names of some days of the week and months include non-ASCII characters. Of course when you .join Unicode strings and bytestrings, the entire result has to be up-converted to Unicode and the upconversion fails in this case. Result: UnicodeDecodeError every so often, when a day (or a month) has a non-ASCII character.

In both cases, Python's automatic conversion 'worked' in an ASCII environment, shielding the creators of the software from having to think about character set issues. But as we can see in these examples, you absolutely do have to think about character set issues; you cannot brush them under the table, because sooner or later they won't fit.

(This is not a novel idea. Various people's strong advice is to always consider encoding issues and to put up fairly strong walls between raw byte data and Unicode. DWiki deals with the whole issue by always using raw byte data, never actually creating any Unicode strings, and letting the user tell DWiki what character set all of the byte data should be claimed to be in when talking to outside parties.)

Written on 29 August 2005.
« Weekly spam summary on August 27th, 2005
The Version Control System dependency problem »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Aug 29 00:44:41 2005
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.