== Thinking about DWiki's Python 3 Unicode issues DWiki (the code behind [[this blog /blog]]) is currently Python 2, and [[it has to move to Python 3 someday DWikiPython3Someday]], even if I'm in no hurry to make that move. The end of 2018, with only a year of official Python 2 support remaining, seems like a good time to take stock of what I expect to be the biggest aspect of that move, which is character set and Unicode issues (this is also the big issue I ignored when [[I got DWiki tentatively running under Python 3 a few years ago DWikiAndPython3]]). The current Python 2 version of DWiki basically ignores encoding issues. It allows you to specify the character set the HTML will say, but it pretty much treats everything as bytes and makes no attempts to validate that your content is actually valid in the character set you've claimed. This is not viable in Python 3 for various reasons, including that it's not how the Python 3 version of WSGI works (as covered in [[PEP 3333 https://www.python.org/dev/peps/pep-3333/]]). Considering Unicode issues for a Python 3 version of DWiki means thinking about everywhere that DWiki reads and writes data from, and deciding what encoding that data is in (and then properly inserting error checks to handle when that data is not actually properly encoded). The primary source of text data for DWiki is the text of pages and comments. Here in 2018, the only sensible encoding for these is UTF-8, and I should probably just hardcode that assumption into reading them from the filesystem (and writing comments out to the filesystem). Relying on Python's system encoding setting, whatever it is, seems not like a good idea, and I don't think this should be settable in DWiki's configuration file. UTF-8 also has the advantage for writing things out that it's a universal encoder; you can encode any Unicode _str_ to UTF-8, which isn't true of all character encoding. Another source of text data is the names of files and directories in the directory hierarchy that DWiki serves content from; these will generally appear in links and various other places. Again, I think the only sensible decision in 2018 is to declare that all filenames have to be UTF-8 and undefined things happen if they aren't. DWiki will do its best to do something sensible, but it can only do so much. Since these names propagate through to links and so on, I will have to make sure that UTF-8 in links is properly encoded. (In general, I probably want to use the '_backslashreplace_' error handling option when decoding to Unicode, because that's the option that both produces correct results and preserves as much information as possible. Since this introduces extra backslashes, I'll have to make sure they're all handled properly.) For HTML output, once again the only sensible encoding is UTF-8. I'll take out the current configuration file option and just hard-code it, so the internal Unicode HTML content that's produced by rendering DWikiText to HTML will be encoded to UTF-8 bytestrings. I'll have to make sure that I consistently calculate my [[ETag http://en.wikipedia.org/wiki/HTTP_ETag]] values from the same version of the content, probably the bytestring version (the current code calculates the ETag hash very late in the process). DWiki interacts with the HTTP world through WSGI, although it's all my own WSGI implementation in a normal setup. [[PEP 3333]] clarifies WSGI for Python 3, and it specifies two sides of things here; [[what types are used where https://www.python.org/dev/peps/pep-3333/#a-note-on-string-types]], and [[some information on header encoding https://www.python.org/dev/peps/pep-3333/#unicode-issues]]. For output, generally my header values will be in ISO-8859-1; however, for some redirections, the _Location:_ header might include UTF-8 derived from filenames, and I'll need to encode it properly. Handling incoming HTTP headers and bodies is going to be more annoying and perhaps more challenging; people and programs may well send me incorrectly formed headers that aren't properly encoded, and for _POST_ requests (for example, for comments) there may be various encodings in use and also the possibility that the data is not correctly encoded (eg it claims to be UTF-8 but doesn't decode properly). In theory [[I might be able to force people to use UTF-8 on comment submissions ../web/FormCharsets]], and probably most browsers would accept that. Since I don't actually know what happens in the wild here, probably a sensible first pass Python 3 implementation should log and reject with a HTTP error any comment submission that is not in UTF-8, or any HTTP request with headers that don't properly decode. If I see any significant quantity of them that appears legitimate, I can add code that tries to handle the situation. (Possibly I should start by adding code to the current Python 2 version of DWiki that looks for this situation and logs information about it. That would give me a year or two of data at a minimum. I should also add an _accept-charset_ attribute to the current comment form.) DWiki has [[on-disk caches /dwiki/Caching]] of data created with Python's [[pickle module https://docs.python.org/3/library/pickle.html]]. I'll have to make sure that the code reads and writes these objects using bytestrings and in binary mode, without trying to encode or decode it (in my current code, I read and write the pickled data myself, not through the pickle module). The current DWiki code does some escaping of bad characters in text, because at one point control characters kept creeping in and blowing up my Atom feeds. This escaping should stay in a Python 3 Unicode world, where it will become more correct and reliable (currently it really operates on bytes, which has various issues). Since in real life most things are properly encoded and even mostly ASCII, mistakes in all of this might lurk undetected for some time. To deal with this, I should set up two torture test environments for DWiki, one where there is UTF-8 everywhere I can think of (including in file and directory names) and one where there is incorrectly encoded UTF-8 everywhere I can think of (or things just not encoded as UTF-8, but instead Latin-1 or something). Running DWiki against both of these would smoke out many problems and areas I've missed. I should also put together some HTTP tests with badly encoded headers and comment _POST_ bodies and so on, although I'm not sure what tools are available to create deliberately incorrect HTTP requests like that. All of this is clearly going to be a long term project and I've probably missed some areas, but at least I'm starting to think about it a bit. Also, I now have some preliminary steps I can take while DWiki is still a Python 2 program (although whether I'll get around to them is another question, as it always is these days with work on DWiki's code). PS: Rereading [[my old entry DWikiAndPython3]] has also reminded me that there's DWiki's logging messages as well. I'll just declare those to be UTF-8 and be done with it, since I can turn any Unicode into UTF-8. The rest of the log file may or may not be UTF-8, but I really don't care. Fortunately DWiki doesn't use syslog (although [[I've already wrestled with that issue Python3SyslogEncoding]]). === Sidebar: DWiki's rendering templates and static file serving DWiki has [[an entire home-grown template system /dwiki/TemplateSyntax]] that's used as part of [[the processing model /dwiki/ProcessingModel]]. These templates should be declared to be UTF-8 and loaded as such, with it being a fatal internal error if they fail to decode properly. DWiki can also be configured to serve static files. In Python 3, these static files should be loaded uninterpreted as (binary mode) bytestrings and served back out that way, especially since they can be used for things like images (which are binary data to start with). Unfortunately this is going to require some code changes in DWiki's storage layer, because right now these static files are loaded from disk with the same code that is also used to load DWikiText pages, which have to be decoded to Unicode as they're loaded.