Wandering Thoughts archives

2018-12-31

Thinking about DWiki's Python 3 Unicode issues

DWiki (the code behind this blog) is currently Python 2, and it has to move to Python 3 someday, even if I'm in no hurry to make that move. The end of 2018, with only a year of official Python 2 support remaining, seems like a good time to take stock of what I expect to be the biggest aspect of that move, which is character set and Unicode issues (this is also the big issue I ignored when I got DWiki tentatively running under Python 3 a few years ago).

The current Python 2 version of DWiki basically ignores encoding issues. It allows you to specify the character set the HTML will say, but it pretty much treats everything as bytes and makes no attempts to validate that your content is actually valid in the character set you've claimed. This is not viable in Python 3 for various reasons, including that it's not how the Python 3 version of WSGI works (as covered in PEP 3333). Considering Unicode issues for a Python 3 version of DWiki means thinking about everywhere that DWiki reads and writes data from, and deciding what encoding that data is in (and then properly inserting error checks to handle when that data is not actually properly encoded).

The primary source of text data for DWiki is the text of pages and comments. Here in 2018, the only sensible encoding for these is UTF-8, and I should probably just hardcode that assumption into reading them from the filesystem (and writing comments out to the filesystem). Relying on Python's system encoding setting, whatever it is, seems not like a good idea, and I don't think this should be settable in DWiki's configuration file. UTF-8 also has the advantage for writing things out that it's a universal encoder; you can encode any Unicode str to UTF-8, which isn't true of all character encoding.

Another source of text data is the names of files and directories in the directory hierarchy that DWiki serves content from; these will generally appear in links and various other places. Again, I think the only sensible decision in 2018 is to declare that all filenames have to be UTF-8 and undefined things happen if they aren't. DWiki will do its best to do something sensible, but it can only do so much. Since these names propagate through to links and so on, I will have to make sure that UTF-8 in links is properly encoded.

(In general, I probably want to use the 'backslashreplace' error handling option when decoding to Unicode, because that's the option that both produces correct results and preserves as much information as possible. Since this introduces extra backslashes, I'll have to make sure they're all handled properly.)

For HTML output, once again the only sensible encoding is UTF-8. I'll take out the current configuration file option and just hard-code it, so the internal Unicode HTML content that's produced by rendering DWikiText to HTML will be encoded to UTF-8 bytestrings. I'll have to make sure that I consistently calculate my ETag values from the same version of the content, probably the bytestring version (the current code calculates the ETag hash very late in the process).

DWiki interacts with the HTTP world through WSGI, although it's all my own WSGI implementation in a normal setup. PEP 3333 clarifies WSGI for Python 3, and it specifies two sides of things here; what types are used where, and some information on header encoding. For output, generally my header values will be in ISO-8859-1; however, for some redirections, the Location: header might include UTF-8 derived from filenames, and I'll need to encode it properly. Handling incoming HTTP headers and bodies is going to be more annoying and perhaps more challenging; people and programs may well send me incorrectly formed headers that aren't properly encoded, and for POST requests (for example, for comments) there may be various encodings in use and also the possibility that the data is not correctly encoded (eg it claims to be UTF-8 but doesn't decode properly). In theory I might be able to force people to use UTF-8 on comment submissions, and probably most browsers would accept that.

Since I don't actually know what happens in the wild here, probably a sensible first pass Python 3 implementation should log and reject with a HTTP error any comment submission that is not in UTF-8, or any HTTP request with headers that don't properly decode. If I see any significant quantity of them that appears legitimate, I can add code that tries to handle the situation.

(Possibly I should start by adding code to the current Python 2 version of DWiki that looks for this situation and logs information about it. That would give me a year or two of data at a minimum. I should also add an accept-charset attribute to the current comment form.)

DWiki has on-disk caches of data created with Python's pickle module. I'll have to make sure that the code reads and writes these objects using bytestrings and in binary mode, without trying to encode or decode it (in my current code, I read and write the pickled data myself, not through the pickle module).

The current DWiki code does some escaping of bad characters in text, because at one point control characters kept creeping in and blowing up my Atom feeds. This escaping should stay in a Python 3 Unicode world, where it will become more correct and reliable (currently it really operates on bytes, which has various issues).

Since in real life most things are properly encoded and even mostly ASCII, mistakes in all of this might lurk undetected for some time. To deal with this, I should set up two torture test environments for DWiki, one where there is UTF-8 everywhere I can think of (including in file and directory names) and one where there is incorrectly encoded UTF-8 everywhere I can think of (or things just not encoded as UTF-8, but instead Latin-1 or something). Running DWiki against both of these would smoke out many problems and areas I've missed. I should also put together some HTTP tests with badly encoded headers and comment POST bodies and so on, although I'm not sure what tools are available to create deliberately incorrect HTTP requests like that.

All of this is clearly going to be a long term project and I've probably missed some areas, but at least I'm starting to think about it a bit. Also, I now have some preliminary steps I can take while DWiki is still a Python 2 program (although whether I'll get around to them is another question, as it always is these days with work on DWiki's code).

PS: Rereading my old entry has also reminded me that there's DWiki's logging messages as well. I'll just declare those to be UTF-8 and be done with it, since I can turn any Unicode into UTF-8. The rest of the log file may or may not be UTF-8, but I really don't care. Fortunately DWiki doesn't use syslog (although I've already wrestled with that issue).

Sidebar: DWiki's rendering templates and static file serving

DWiki has an entire home-grown template system that's used as part of the processing model. These templates should be declared to be UTF-8 and loaded as such, with it being a fatal internal error if they fail to decode properly.

DWiki can also be configured to serve static files. In Python 3, these static files should be loaded uninterpreted as (binary mode) bytestrings and served back out that way, especially since they can be used for things like images (which are binary data to start with). Unfortunately this is going to require some code changes in DWiki's storage layer, because right now these static files are loaded from disk with the same code that is also used to load DWikiText pages, which have to be decoded to Unicode as they're loaded.

DWikiPython3UnicodeIssues written at 01:01:29; Add Comment

2018-12-15

Python 3's approach to filenames and arguments is pragmatically right

A while back I read John Goerzen's The Python Unicode Mess, which decries the Python 3 mess of dealing with filenames and command line arguments on Unix that are not encoded in the program's assumed encoding. As Goerzen notes:

So if you want to actually handle Unix filenames properly in Python, you:

  • Must have a processing path that fully avoids Python strings.
  • Must use sys.{stdin,stdout}.buffer instead of just sys.stdin/stdout
  • Must supply filenames as bytes to various functions. See PEP 0471 for this comment: “Like the other functions in the os module, scandir() accepts either a bytes or str object for the path parameter, and returns the DirEntry.name and DirEntry.path attributes with the same type as path. However, it is strongly recommended to use the str type, as this ensures cross-platform support for Unicode filenames. (On Windows, bytes filenames have been deprecated since Python 3.3).” So if you want to be cross-platform, it’s even worse, because you can’t use str on Unix nor bytes on Windows.

Back in the days when it was new, Python 3 used to be very determined that Unix was Unicode/UTF-8. Years ago this was a big reason that I said you should avoid it from the perspective of a Unix sysadmin. These days things are better; we have things like os.environb and a relatively well defined way of handling sys.argv. This ultimately comes from PEP 383, which gave us the 'surrogateescape' error handler (see the codecs module).

All of this is irritating and unpleasant. Unfortunately, it's also the pragmatically right answer for reasons that PEP 383 alludes to, although it doesn't describe them the way that I would. PEP 383 says:

On the other hand, Microsoft Windows NT has corrected the original design limitation of Unix, and made it explicit in its system interfaces that these data (file names, environment variables, command line arguments) are indeed character data, by providing a Unicode-based API [...]

Let me translate this: filenames, command line arguments, and so on are no longer portable abstractions. They fundamentally mean different things on Unix and on Windows. On Windows, they are 'Unicode' (actually UTF-16) and may include characters not representable as single bytes, while on Unix they are and remain bytes and may include any byte value or sequence except 0. These are two incompatible types, especially once people start encoding non-ASCII filenames or command line arguments on Unix and want their programs to understand the decoded forms in Unicode.

(Or, if you prefer to flip this around, when people start using non-ASCII filenames and command line arguments and so on on Windows and want their programs to understand those as Unicode strings and characters.)

This is a hard problem and modern Python 3 has made the pragmatic choice that it's not going to pretend that things are portable when they aren't (early Python 3 tried to some extent and that blew up in its face). If you are working in the happy path on Unix where you're dealing with properly encoded data, you can ignore this by letting Python 3 automatically decode things to Unicode strs; otherwise, you must work with the raw (Unix) values, and Python 3 will provide them if you ask (and will surface at least some of them by default).

(There are other possible answers but I think that they're all worse than Python 3's current ones for reasons beyond the scope of this entry. For instance, I think that having os.listdir() return a different type on Windows than on Unix would be a bad mistake.)

I'll note that Python 2 is not magically better than Python 3 here. It's just that Python 2 chose to implicitly prioritize Unix over Windows by deciding that filenames, command line arguments, and so on were bytestrings instead of Unicode strings. I rather suspect that this caused Windows people using Python a certain amount of heartburn; we probably just didn't hear as much from them for various reasons.

(You can argue about whether or not Python 3 should have made Unicode the fundamental string type, but that decision was never a pragmatic one and it was made by Python developers very early on. Arguably it's the single decision that created 'Python 3' instead of an ongoing evolution of Python 2.)

PS: This probably counts as me partially or completely changing my mind about things I've said in the past. So be it; time changes us all, and I certainly have different and more positive views on Python 3 now.

Python3PragmaticFilenames written at 00:58:48; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.