A wish: setting Python 3 to do no implicit Unicode conversions

November 12, 2014

In light of the lurking Unicode conversion issues in my DWiki port to Python 3, one of the things I've realized I would like in Python 3 is some way to turn off all of the implicit conversions to and from Unicode that Python 3 currently does when it talks to the outside world.

The goal here is the obvious one: since any implicit conversion is a place where I need to consider how to handle errors, character encodings, and so on, making them either raise errors or produce bytestrings would allow me to find them all (and to force me to handle things explicitly). Right now many implicit conversions can sail quietly past because they're only having to deal with valid input or simple output, only to blow up in my face later.

(Yes, in a greenfield project you would be paying close attention to all places where you deal with the outside world. Except of course for the ones that you overlook because you don't think about them and they just work. DWiki is not in any way a greenfield project and in Python 2 it arrogantly doesn't use Unicode at all.)

It's possible that you can fake this by setting your (Unix) character encoding to either an existing encoding that is going to blow up on utf-8 input and output (including plain ASCII) or to a new Python encoding that always errors out. However this gets me down into the swamps of default Python encodings and how to change them, which I'm not sure I want to venture into. I'd like either an officially supported feature or an easy hack. I suspect that I'm dreaming on the former.

(I suspect that there are currently places in Python 3 that always both always perform a conversion and don't provide an API to set the character encoding for the conversion. Such places are an obvious problem for an official 'conversion always produces errors' setting.)

Comments on this page:

By Ewen McNeill at 2014-11-12 04:48:34:

AFAICT from some quick research, you can write your own encoding/decoding by writing a codec, and then register it to a name with a method there.

It appears that you can use sys.setdefaultencoding() kludge in Python 2 to set the default encoding, but the same trick doesn't seem to work in Python 3 (so it's probably accidental it worked in Python 2 -- induced by the reload(), which I don't see in Python 3).

Following the breadcrumbs it appears one is supposed to use a site module, or perhaps more accurately a "sitecustomize" module, which may get called before it is removed. But I've not tried that. If it does get to call str.setdefaultencoding() (ie, before it's removed from str's namespace) then that might be the most direct way to do it for testing in Python 3. (Or save a reference to that function in some other namespace?)


PS: Picking a character set with a 2-byte or 4-byte encoding (maybe one of the Asian ones?) that is unlikely to ever work for an entire string in your existing code/filesystem might be sufficient for most of your testing. Assuming you're planning on explicitly specifying it everywhere in your code; the default default is utf-8 in Python3 (and "ascii" in Python 2 AFAICT). But that may or may not affect the default encoding of the Python source read in too... so you may have to explicitly mark those as utf8.

Written on 12 November 2014.
« Why I don't have a real profile picture anywhere
I want opportunistic, identity-less encryption on the Internet »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed Nov 12 01:31:01 2014
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.