Thinking about DWiki's Python 3 Unicode issues
DWiki (the code behind this blog) is currently Python 2, and it has to move to Python 3 someday, even if I'm in no hurry to make that move. The end of 2018, with only a year of official Python 2 support remaining, seems like a good time to take stock of what I expect to be the biggest aspect of that move, which is character set and Unicode issues (this is also the big issue I ignored when I got DWiki tentatively running under Python 3 a few years ago).
The current Python 2 version of DWiki basically ignores encoding issues. It allows you to specify the character set the HTML will say, but it pretty much treats everything as bytes and makes no attempts to validate that your content is actually valid in the character set you've claimed. This is not viable in Python 3 for various reasons, including that it's not how the Python 3 version of WSGI works (as covered in PEP 3333). Considering Unicode issues for a Python 3 version of DWiki means thinking about everywhere that DWiki reads and writes data from, and deciding what encoding that data is in (and then properly inserting error checks to handle when that data is not actually properly encoded).
The primary source of text data for DWiki is the text of pages and
comments. Here in 2018, the only sensible encoding for these is
UTF-8, and I should probably just hardcode that assumption into
reading them from the filesystem (and writing comments out to the
filesystem). Relying on Python's system encoding setting, whatever
it is, seems not like a good idea, and I don't think this should
be settable in DWiki's configuration file. UTF-8 also has the
advantage for writing things out that it's a universal encoder; you
can encode any Unicode
str to UTF-8, which isn't true of all
Another source of text data is the names of files and directories in the directory hierarchy that DWiki serves content from; these will generally appear in links and various other places. Again, I think the only sensible decision in 2018 is to declare that all filenames have to be UTF-8 and undefined things happen if they aren't. DWiki will do its best to do something sensible, but it can only do so much. Since these names propagate through to links and so on, I will have to make sure that UTF-8 in links is properly encoded.
(In general, I probably want to use the '
handling option when decoding to Unicode, because that's the option
that both produces correct results and preserves as much information
as possible. Since this introduces extra backslashes, I'll have to
make sure they're all handled properly.)
For HTML output, once again the only sensible encoding is UTF-8. I'll take out the current configuration file option and just hard-code it, so the internal Unicode HTML content that's produced by rendering DWikiText to HTML will be encoded to UTF-8 bytestrings. I'll have to make sure that I consistently calculate my ETag values from the same version of the content, probably the bytestring version (the current code calculates the ETag hash very late in the process).
DWiki interacts with the HTTP world through WSGI, although it's all
my own WSGI implementation in a normal setup. PEP 3333 clarifies
WSGI for Python 3, and it specifies two sides of things here; what
types are used where, and
some information on header encoding. For
output, generally my header values will be in ISO-8859-1; however,
for some redirections, the
Location: header might include UTF-8
derived from filenames, and I'll need to encode it properly. Handling
incoming HTTP headers and bodies is going to be more annoying and
perhaps more challenging; people and programs may well send me
incorrectly formed headers that aren't properly encoded, and for
POST requests (for example, for comments) there may be various
encodings in use and also the possibility that the data is not
correctly encoded (eg it claims to be UTF-8 but doesn't decode
properly). In theory I might be able to force people to use UTF-8
on comment submissions, and probably most
browsers would accept that.
Since I don't actually know what happens in the wild here, probably a sensible first pass Python 3 implementation should log and reject with a HTTP error any comment submission that is not in UTF-8, or any HTTP request with headers that don't properly decode. If I see any significant quantity of them that appears legitimate, I can add code that tries to handle the situation.
(Possibly I should start by adding code to the current Python 2
version of DWiki that looks for this situation and logs information
about it. That would give me a year or two of data at a minimum.
I should also add an
accept-charset attribute to the current
DWiki has on-disk caches of data created with Python's pickle module. I'll have to make sure that the code reads and writes these objects using bytestrings and in binary mode, without trying to encode or decode it (in my current code, I read and write the pickled data myself, not through the pickle module).
The current DWiki code does some escaping of bad characters in text, because at one point control characters kept creeping in and blowing up my Atom feeds. This escaping should stay in a Python 3 Unicode world, where it will become more correct and reliable (currently it really operates on bytes, which has various issues).
Since in real life most things are properly encoded and even mostly
ASCII, mistakes in all of this might lurk undetected for some time.
To deal with this, I should set up two torture test environments
for DWiki, one where there is UTF-8 everywhere I can think of
(including in file and directory names) and one where there is
incorrectly encoded UTF-8 everywhere I can think of (or things just
not encoded as UTF-8, but instead Latin-1 or something). Running
DWiki against both of these would smoke out many problems and areas
I've missed. I should also put together some HTTP tests with badly
encoded headers and comment
POST bodies and so on, although I'm
not sure what tools are available to create deliberately incorrect
HTTP requests like that.
All of this is clearly going to be a long term project and I've probably missed some areas, but at least I'm starting to think about it a bit. Also, I now have some preliminary steps I can take while DWiki is still a Python 2 program (although whether I'll get around to them is another question, as it always is these days with work on DWiki's code).
PS: Rereading my old entry has also reminded me that there's DWiki's logging messages as well. I'll just declare those to be UTF-8 and be done with it, since I can turn any Unicode into UTF-8. The rest of the log file may or may not be UTF-8, but I really don't care. Fortunately DWiki doesn't use syslog (although I've already wrestled with that issue).
Sidebar: DWiki's rendering templates and static file serving
DWiki has an entire home-grown template system that's used as part of the processing model. These templates should be declared to be UTF-8 and loaded as such, with it being a fatal internal error if they fail to decode properly.
DWiki can also be configured to serve static files. In Python 3, these static files should be loaded uninterpreted as (binary mode) bytestrings and served back out that way, especially since they can be used for things like images (which are binary data to start with). Unfortunately this is going to require some code changes in DWiki's storage layer, because right now these static files are loaded from disk with the same code that is also used to load DWikiText pages, which have to be decoded to Unicode as they're loaded.