2005-08-29
Python's dangerous automatic Unicode conversions
I'm a little bit late to the Unicode in Python debate party; Ian Bicking and others were writing about it back in early August, for example here and here. By now, lots of people around the Python blog community have weighed in on the high-level issues. My grumbles about Python's Unicode support are more low-level, like the problem with Unicode in hashes that Ian Bicking found.
Specifically, I think that Python is too helpful with automatic conversions between Unicode strings and byte strings, and it would be easier to get fully reliable programs if there were less such automatic conversions.
The problem with Python's automatic conversions now is that they work most of the time but not all of the time. This in turn creates programs with subtle, hard to find bugs that may not crop up for some time, especially because some of the bugs are consequences of sensible or necessary decisions made in the standard library.
(All of the following examples are drawn from real code that people approached the #python IRC channel for help with.)
Our first contestant is a classic:
flist = os.listdir(u".") flist.sort()
If you give os.listdir a Unicode string as the directory to list, it
tries to return a list of Unicode strings. However, if there is a
filename in the directory that cannot be encoded in Unicode, it is
returned as-is as a bytestring. .sort() on such a mixed list
attempts to convert the bytestring into Unicode, but this fails.
Result: our friend UnicodeDecodeError. But only if the directory has
a mix of (usually) plain ASCII and encoded filenames. All plain ASCII
or all encoded, and you're fine.
The second example cropped up in someone's web site package, roughly as:
headers = []
headers.append(u"Something: Here.")
...
from time import localtime, strftime
tstr = strftime("Last modified: %c %Z", \
localtime())
headers.append(tstr)
print "\r\n".join(headers)
The problem: in some locales, the names of some days of the week and
months include non-ASCII characters. Of course when you .join
Unicode strings and bytestrings, the entire result has to be
up-converted to Unicode and the upconversion fails in this
case. Result: UnicodeDecodeError every so often, when a day (or a
month) has a non-ASCII character.
In both cases, Python's automatic conversion 'worked' in an ASCII environment, shielding the creators of the software from having to think about character set issues. But as we can see in these examples, you absolutely do have to think about character set issues; you cannot brush them under the table, because sooner or later they won't fit.
(This is not a novel idea. Various people's strong advice is to always consider encoding issues and to put up fairly strong walls between raw byte data and Unicode. DWiki deals with the whole issue by always using raw byte data, never actually creating any Unicode strings, and letting the user tell DWiki what character set all of the byte data should be claimed to be in when talking to outside parties.)
2005-08-01
Multilevel list comprehensions in Python
Python has recently (at least for some values of recently) grown
'list comprehensions',
which let you easily iterate over a list to transform or select
entries (or both). List comprehensions can be thought of as syntactic
sugar for map and filter operations, but they're actually more
powerful.
One reason is that you can write a multilevel list comprehension, which effectively iterates over multiple levels of lists. Take the case where you have a list within a list and want to return all of the low-level elements as a list:
l = []
for rr in qa:
for s in rr.strings:
l.append(s)
This can be rewritten as a two-level list comprehension:
l = [s for rr in qa for s in rr.strings]
This can't easily be done via map. (We would probably have to roll
in a reduce to flatten the list of lists that map would give us
into a single-level list.)
Multilevel list comprehensions work left to right; the leftmost 'for X in Y' is the outermost one, and then we step inwards as we move right. You can also use if conditions, so the correct version of the list comprehension I wrote, in context and with error checking, would be:
from dns.rdatatype import TXT
l = [s for rr in qa if rr.rdtype == TXT \
for s in rr.strings]
What impresses me about Python is that this works just the way I thought it would work and both of these examples worked the first time, just as I wrote them, and needed no debugging. (The first version actually got used in a scratch program.)