My arrogance about Unicode and character encodings

July 16, 2012

Yesterday I described how I could get away with ignoring encoding issues and thus how forced Unicode was and is irritating. However there is a gotcha in my approach, one that hides behind a bit of arrogance. Let me repeat the core bit of how my programs typically work:

What they process is a mixture of ASCII (for keywords, directives, and so on, all of the things the program had to interpret) and uninterpreted bytestrings, which are simply regurgitated to the user as-is in appropriate situations.

This simple, reasonable description contains an assumption: this approach assumes that any encoding will be a superset of ASCII, because it assumes that code can extract plain text ASCII from a file without knowing the file's encoding. This works if and only if the file's actual encoding is implemented as ASCII plus other stuff hiding around the edges, which is true for many encodings including UTF-8 but not for all of them.

This is the arrogance of my blithe approach to ignoring character encoding issues. It assumes either that all character sets are a superset of ASCII or that any exceptions are sufficiently uncommon that I don't have to care about them. Of course, by assuming that my programs will never be used by people with such character sets I've insured that they never will be.

The conclusion that I draw from this is I can't ignore character encoding unless I'm willing to be somewhat arrogant. The pain of dealing with decoding and encoding issues is simply the price of not being arrogant.

(On the other hand it's still very tempting to be arrogant this way, for reasons that boil down to 'I can get away with it because the environments where it matters are probably quite rare, and it's much easier'.)


Comments on this page:

From 89.243.102.89 at 2012-07-16 16:56:22:

Looks like that applies to all the CJK legacy encodings.

I thought I'd seen someone stating that non-ASCII locales were explicitly not supported... I assume it applied to Linux only. But it would tend to break any string exchanges with a kernel API. dmesg, mount options, sysfs attributes... the shebang, and PATH.

POSIX almost outright warns you not to do it. The results of switching between locales that code characters in the "portable character set" differently is "unspecified". And the portable character set is something like 85% of ASCII. (Hopefully it's just the controls they elided. I haven't checked for sure).

I don't think you should feel guilty about it on UNIX. Windows is different - I don't think your original argument applies there, because Windows is much more serious about filenames being Unicode.

- Alan
By cks at 2012-07-17 11:09:53:

Even for filenames, I think that a lot depends on the specifics of the encoding. In Unix you can get away with an encoding that avoids ever generating either null bytes or a '/' (and ideally avoids generating control characters), and I believe that at least some CJK legacy encodings are structured like this.

(Technically the thing that destroys my arrogant approach is an encoding that either generates significant runs of ASCII characters or encodes the basic Latin characters in some other way than as the usual ASCII characters. You can get into corner cases like Shift JIS where you're missing a few of the punctuation characters.)

Written on 16 July 2012.
« My general issue with Unicode in Python 3
Getting an Ubuntu 12.04 machine to give you boot messages »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Jul 16 01:16:11 2012
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.