2014-11-12
A wish: setting Python 3 to do no implicit Unicode conversions
In light of the lurking Unicode conversion issues in my DWiki port to Python 3, one of the things I've realized I would like in Python 3 is some way to turn off all of the implicit conversions to and from Unicode that Python 3 currently does when it talks to the outside world.
The goal here is the obvious one: since any implicit conversion is a place where I need to consider how to handle errors, character encodings, and so on, making them either raise errors or produce bytestrings would allow me to find them all (and to force me to handle things explicitly). Right now many implicit conversions can sail quietly past because they're only having to deal with valid input or simple output, only to blow up in my face later.
(Yes, in a greenfield project you would be paying close attention to all places where you deal with the outside world. Except of course for the ones that you overlook because you don't think about them and they just work. DWiki is not in any way a greenfield project and in Python 2 it arrogantly doesn't use Unicode at all.)
It's possible that you can fake this by setting your (Unix) character encoding to either an existing encoding that is going to blow up on utf-8 input and output (including plain ASCII) or to a new Python encoding that always errors out. However this gets me down into the swamps of default Python encodings and how to change them, which I'm not sure I want to venture into. I'd like either an officially supported feature or an easy hack. I suspect that I'm dreaming on the former.
(I suspect that there are currently places in Python 3 that always both always perform a conversion and don't provide an API to set the character encoding for the conversion. Such places are an obvious problem for an official 'conversion always produces errors' setting.)
2014-11-10
What it took to get DWiki running under Python 3
For quixotic reasons I recently decided to see how far I could get with porting DWiki (the code behind this blog) to Python 3 before I ran out of either patience or enthusiasm. I've gotten much further than I expected; at this point I'm far enough that it can handle this entire site when running under Python's builtin basic HTTP server, rendering the HTML exactly the same as the Python 2 version does.
Getting this far basically took three steps. The largest step was
updating the code to modern Python 2,
because Python 3 doesn't accept various bits of old syntax. After
I'd done that, I ran 2to3 over the codebase to do a bunch of
mechanical substitutions, mostly rewriting print statements
and standard modules that had gotten reorganized in the transition.
The final necessary step was some Unicode conversion and mangling
(and with it reading some files in binary mode).
All of this sounds great, but the reality is that DWiki is only limping along under Python 3 and this is exactly because of the Unicode issue. Closely related to this is that I have not revised my WSGI code for any changes in the Python 3 version of WSGI (I'm sure there must be some, just because of character encoding issues). Doing a real Python 3 port of DWiki would require dealing with this, which means going through everywhere that DWiki talks to the outside world (for file IO, for logging, and for reading and replying to HTTP requests), figuring out where the conversion boundary is between Unicode and bytestrings, what character encoding I need to use and how to recognize this, and finally what to do about encoding and decoding errors. Complicating this is that some of these encoding boundaries are further upstream than you might think. Two closely related cases I've run into so far is that DWiki computes the ETag and Content-Length for the HTTP reply itself, and for obvious reasons both of these must be calculated against the encoded bytestring version of the content body instead of its original Unicode version. This happens relatively far inside my code, not right at the boundary between WSGI and me.
(Another interesting case is encoding URLs that have non-ASCII characters in them, for example from a page with a name that has Unicode characters in it. Such URLs can get encoded both in HTML and in the headers of redirects, and need to be decoded at some point on the way in, where I probably need to %-decode to a bytestring and then decode that bytestring to a Unicode string.)
Handling encoding and decoding errors are a real concern of mine
for a production quality version of DWiki in Python 3. The problem
is that most input these days is well behaved, so you can go quite
a while before someone sends you illegal UTF-8 in headers, URLs,
or POST bodies (or for that matter sends you something in another
character set). This handily disguises failures to handle encoding
and decoding problems, since things work almost all the time. And
Python 3 has a lot of places with implicit conversions.
That these Unicode issues exist doesn't surprise me. Rather the reverse; dealing with Unicode has always been the thing that I thought would be hardest about any DWiki port to Python 3. I am pleasantly surprised by how few code changes were required to get to this point, as I was expecting much more code changes (and for them to be much more difficult to make, I think because at some point I'd got the impression that 2to3 wasn't very well regarded).
Given the depths of the Unicode swamps here, I'm not sure that I'll go much further with a Python 3 version of DWiki than I already have. But, as mentioned, it is both nice and surprising to me that I could get this far with this little effort. The basics of porting to Python 3 are clearly a lot less work than I was afraid of.
2014-11-07
Porting to Python 3 by updating to modern Python 2
For quixotic reasons I decided to take a shot at porting DWiki to Python 3 just to see how difficult and annoying it would be and how far I could get. One of the surprising things about the process has been that a great deal of porting to Python 3 has been less about porting the code and more about modernizing it to current Python 2 standards.
DWiki is what is now a pretty old codebase (as you might guess) and even when it was new it wasn't written with the latest Python idioms for various reasons, including that I started with Python back in the Python 1.5 era. As a result it contained a number of long obsolete idioms that are very much not supported in Python 3 and had to be changed. Once the dust settled it turned out that modernizing these idioms was most (although not all) of what was needed to make DWiki at least start up under Python 3.
At this point you might be wondering just what ancient idioms I was still using. I'm glad you asked. DWiki was doing all of these:
- '
raise EXCEPTION, STR' instead of 'raise E(STR)'. I have no real excuse here; I'm sure this was considered obsolete even when I started writing DWiki. - '
except CLS, VAR:' instead of 'except CLS as VAR:', which I think is at least less ancient than myraiseusage. - using comparison functions in
.sort()instead ofkey=...andreverse=True. Switching made things clearer. - dividing two integers with '
/' and expecting the result to be an integer. In Python 3 this is an exact float instead, which caused an interesting bug when I used the result as an (integer) counter. Using '//' explicitly is better and is needed in Python 3.
I consider this modernization of the Python 2 codebase to be a good thing. Even if I never do anything with a Python 3 version of DWiki, updating to the current best practice idioms is an improvement of the code (especially since it's public and I'd like it to not be too embarrassing). I'm glad that trying out a Python 3 port has pushed me into doing this; it really has been overdue.
(Another gotcha that Python 3 exposed is that in at least one place
I was assuming that 'None > 0' was a valid comparison to make and
would be False. This works in Python 2 but it's not exactly a
good idea and fixing the code to explicitly check for None is a
good cleanup. Since this sort of stuff can only really be checked
dynamically there may be other spots that do this.)