Thinking about DWiki's Python 3 Unicode issues

December 31, 2018

DWiki (the code behind this blog) is currently Python 2, and it has to move to Python 3 someday, even if I'm in no hurry to make that move. The end of 2018, with only a year of official Python 2 support remaining, seems like a good time to take stock of what I expect to be the biggest aspect of that move, which is character set and Unicode issues (this is also the big issue I ignored when I got DWiki tentatively running under Python 3 a few years ago).

The current Python 2 version of DWiki basically ignores encoding issues. It allows you to specify the character set the HTML will say, but it pretty much treats everything as bytes and makes no attempts to validate that your content is actually valid in the character set you've claimed. This is not viable in Python 3 for various reasons, including that it's not how the Python 3 version of WSGI works (as covered in PEP 3333). Considering Unicode issues for a Python 3 version of DWiki means thinking about everywhere that DWiki reads and writes data from, and deciding what encoding that data is in (and then properly inserting error checks to handle when that data is not actually properly encoded).

The primary source of text data for DWiki is the text of pages and comments. Here in 2018, the only sensible encoding for these is UTF-8, and I should probably just hardcode that assumption into reading them from the filesystem (and writing comments out to the filesystem). Relying on Python's system encoding setting, whatever it is, seems not like a good idea, and I don't think this should be settable in DWiki's configuration file. UTF-8 also has the advantage for writing things out that it's a universal encoder; you can encode any Unicode str to UTF-8, which isn't true of all character encoding.

Another source of text data is the names of files and directories in the directory hierarchy that DWiki serves content from; these will generally appear in links and various other places. Again, I think the only sensible decision in 2018 is to declare that all filenames have to be UTF-8 and undefined things happen if they aren't. DWiki will do its best to do something sensible, but it can only do so much. Since these names propagate through to links and so on, I will have to make sure that UTF-8 in links is properly encoded.

(In general, I probably want to use the 'backslashreplace' error handling option when decoding to Unicode, because that's the option that both produces correct results and preserves as much information as possible. Since this introduces extra backslashes, I'll have to make sure they're all handled properly.)

For HTML output, once again the only sensible encoding is UTF-8. I'll take out the current configuration file option and just hard-code it, so the internal Unicode HTML content that's produced by rendering DWikiText to HTML will be encoded to UTF-8 bytestrings. I'll have to make sure that I consistently calculate my ETag values from the same version of the content, probably the bytestring version (the current code calculates the ETag hash very late in the process).

DWiki interacts with the HTTP world through WSGI, although it's all my own WSGI implementation in a normal setup. PEP 3333 clarifies WSGI for Python 3, and it specifies two sides of things here; what types are used where, and some information on header encoding. For output, generally my header values will be in ISO-8859-1; however, for some redirections, the Location: header might include UTF-8 derived from filenames, and I'll need to encode it properly. Handling incoming HTTP headers and bodies is going to be more annoying and perhaps more challenging; people and programs may well send me incorrectly formed headers that aren't properly encoded, and for POST requests (for example, for comments) there may be various encodings in use and also the possibility that the data is not correctly encoded (eg it claims to be UTF-8 but doesn't decode properly). In theory I might be able to force people to use UTF-8 on comment submissions, and probably most browsers would accept that.

Since I don't actually know what happens in the wild here, probably a sensible first pass Python 3 implementation should log and reject with a HTTP error any comment submission that is not in UTF-8, or any HTTP request with headers that don't properly decode. If I see any significant quantity of them that appears legitimate, I can add code that tries to handle the situation.

(Possibly I should start by adding code to the current Python 2 version of DWiki that looks for this situation and logs information about it. That would give me a year or two of data at a minimum. I should also add an accept-charset attribute to the current comment form.)

DWiki has on-disk caches of data created with Python's pickle module. I'll have to make sure that the code reads and writes these objects using bytestrings and in binary mode, without trying to encode or decode it (in my current code, I read and write the pickled data myself, not through the pickle module).

The current DWiki code does some escaping of bad characters in text, because at one point control characters kept creeping in and blowing up my Atom feeds. This escaping should stay in a Python 3 Unicode world, where it will become more correct and reliable (currently it really operates on bytes, which has various issues).

Since in real life most things are properly encoded and even mostly ASCII, mistakes in all of this might lurk undetected for some time. To deal with this, I should set up two torture test environments for DWiki, one where there is UTF-8 everywhere I can think of (including in file and directory names) and one where there is incorrectly encoded UTF-8 everywhere I can think of (or things just not encoded as UTF-8, but instead Latin-1 or something). Running DWiki against both of these would smoke out many problems and areas I've missed. I should also put together some HTTP tests with badly encoded headers and comment POST bodies and so on, although I'm not sure what tools are available to create deliberately incorrect HTTP requests like that.

All of this is clearly going to be a long term project and I've probably missed some areas, but at least I'm starting to think about it a bit. Also, I now have some preliminary steps I can take while DWiki is still a Python 2 program (although whether I'll get around to them is another question, as it always is these days with work on DWiki's code).

PS: Rereading my old entry has also reminded me that there's DWiki's logging messages as well. I'll just declare those to be UTF-8 and be done with it, since I can turn any Unicode into UTF-8. The rest of the log file may or may not be UTF-8, but I really don't care. Fortunately DWiki doesn't use syslog (although I've already wrestled with that issue).

Sidebar: DWiki's rendering templates and static file serving

DWiki has an entire home-grown template system that's used as part of the processing model. These templates should be declared to be UTF-8 and loaded as such, with it being a fatal internal error if they fail to decode properly.

DWiki can also be configured to serve static files. In Python 3, these static files should be loaded uninterpreted as (binary mode) bytestrings and served back out that way, especially since they can be used for things like images (which are binary data to start with). Unfortunately this is going to require some code changes in DWiki's storage layer, because right now these static files are loaded from disk with the same code that is also used to load DWikiText pages, which have to be decoded to Unicode as they're loaded.


Comments on this page:

By tanzer@swing.co.at at 2018-12-31 07:05:57:

Beware: If your pickles contains datetime instances, you won't be able to read pickles created by Py2 in Py3 (see https://bugs.python.org/issue22005).

Knowing whether the form was submitted in UTF-8 might just be enough in today’s world (though, who knows), but if you feel like overbuilding, there is a simple solution for detecting the charset of a submitted form.

(I’ve long found it odd that this technique doesn’t appear to have found any adoption. I read the proposal right when it was published and was convinced that this would quickly become a par-for-the-course feature for any web framework to implement as infrastructure, allowing app developers to completely forget the whole issue. Instead it just came and went…)

By cks at 2018-12-31 18:52:52:

My current plan to deal with any Python 2/3 incompatibility in pickles is just to clear my caches when I upgrade (I clear them periodically by hand anyway, and DWiki is designed to deal with that). I expect there to be some issues in general, so re-starting from scratch is the cleanest way.

In a world where browsers generally implement the accept-charset form attribute or default to the HTML page's encoding (which will be UTF-8), I'm not sure I want to go through lots of effort to sniff out the actual encoding and then decode it. Of course I need to test that we actually live in this world, but if we do, assuming correctly behaving browsers is much simpler.

(Since I can shim this behavior into the current Python 2 code I can at least check and log that people seem to be submitting UTF-8 before I have to actually care about it.)

By Sean A. at 2019-01-26 02:58:49:

Out of curiosity, why use backslashreplace instead of surrogateescap? (I ask because it seems to me that surrogateescape also loses no information, is guaranteed to work with any binary input, and is designed for reading unknown encodings.)

By Sean A. at 2019-01-26 03:00:37:

Oh. And is trivial to convert back into the original binary data.

Written on 31 December 2018.
« Our third generation ZFS fileservers and their setup
Our Linux ZFS fileservers work much like to our OmniOS ones »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Dec 31 01:01:29 2018
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.