What it took to get DWiki running under Python 3
For quixotic reasons I recently decided to see how far I could get with porting DWiki (the code behind this blog) to Python 3 before I ran out of either patience or enthusiasm. I've gotten much further than I expected; at this point I'm far enough that it can handle this entire site when running under Python's builtin basic HTTP server, rendering the HTML exactly the same as the Python 2 version does.
Getting this far basically took three steps. The largest step was
updating the code to modern Python 2,
because Python 3 doesn't accept various bits of old syntax. After
I'd done that, I ran
2to3 over the codebase to do a bunch of
mechanical substitutions, mostly rewriting
All of this sounds great, but the reality is that DWiki is only limping along under Python 3 and this is exactly because of the Unicode issue. Closely related to this is that I have not revised my WSGI code for any changes in the Python 3 version of WSGI (I'm sure there must be some, just because of character encoding issues). Doing a real Python 3 port of DWiki would require dealing with this, which means going through everywhere that DWiki talks to the outside world (for file IO, for logging, and for reading and replying to HTTP requests), figuring out where the conversion boundary is between Unicode and bytestrings, what character encoding I need to use and how to recognize this, and finally what to do about encoding and decoding errors. Complicating this is that some of these encoding boundaries are further upstream than you might think. Two closely related cases I've run into so far is that DWiki computes the ETag and Content-Length for the HTTP reply itself, and for obvious reasons both of these must be calculated against the encoded bytestring version of the content body instead of its original Unicode version. This happens relatively far inside my code, not right at the boundary between WSGI and me.
(Another interesting case is encoding URLs that have non-ASCII characters in them, for example from a page with a name that has Unicode characters in it. Such URLs can get encoded both in HTML and in the headers of redirects, and need to be decoded at some point on the way in, where I probably need to %-decode to a bytestring and then decode that bytestring to a Unicode string.)
Handling encoding and decoding errors are a real concern of mine
for a production quality version of DWiki in Python 3. The problem
is that most input these days is well behaved, so you can go quite
a while before someone sends you illegal UTF-8 in headers, URLs,
POST bodies (or for that matter sends you something in another
character set). This handily disguises failures to handle encoding
and decoding problems, since things work almost all the time. And
Python 3 has a lot of places with implicit conversions.
That these Unicode issues exist doesn't surprise me. Rather the reverse; dealing with Unicode has always been the thing that I thought would be hardest about any DWiki port to Python 3. I am pleasantly surprised by how few code changes were required to get to this point, as I was expecting much more code changes (and for them to be much more difficult to make, I think because at some point I'd got the impression that 2to3 wasn't very well regarded).
Given the depths of the Unicode swamps here, I'm not sure that I'll go much further with a Python 3 version of DWiki than I already have. But, as mentioned, it is both nice and surprising to me that I could get this far with this little effort. The basics of porting to Python 3 are clearly a lot less work than I was afraid of.