Unicode's two new problems

July 24, 2012

If you want to convert code from dealing with more or less uninterpreted strings of bytes into dealing properly with character encodings, ie from using raw bytes into using Unicode, you will have two new fundamental problems. (You will also have a number of practical problems like sorting, but these can be addressed with suitable Unicode libraries and perhaps a certain amount of handwaving.)

These two problems are what to do with invalid input when you decode from byte strings into Unicode and what to do with Unicode code points that can't represented in the output character encoding. Both are new failure points introduced by adding Unicode, and neither can really be handled for you by a library because what to do is generally dependent on the specific situation in your code.

Having said that, there are various locales where one or the other problems do not apply. If the output locale is UTF-8 you will always be able to encode any Unicode code point (the same is true for less common locales like UCS-16). Many input locales have no such thing as invalid input; all of their bytes and byte sequences map to Unicode code points (this is true of pretty much all encodings that just use single bytes, for example). UTF-8 is actually uncommon in allowing plenty of invalid input sequences (so you win on output but you lose on input).

This causes potential pernicious problems if you develop in a well done UTF-8 locale. Your tools will normally not generate invalid UTF-8 input and of course there's nothing you can't output; the result is that none of your error paths for input and output will get exercised. In fact you can get away without any error handling for decoding and encoding errors (this is easier in some languages than others).

If you care about handling both problems you will need to test in a non-UTF-8 locale in order to provoke output encoding errors and in a UTF-8 locale with deliberately broken input in order to create input decoding errors. Even if you only support using your code in a UTF-8 locale you should test with invalid input because you will almost certainly see it sooner or later.

(I suspect that there is lots of code in lots of languages that doesn't make any attempt to handle either problem, precisely because everything runs fine normally (in a UTF-8 environment) even if you don't.)

(None of this is at all new. I just feel like writing it down myself in one place where I can find it.)


Comments on this page:

From 109.78.106.124 at 2012-07-24 11:30:35:

Reminds me of "ungarbling" I've had to do many times

http://www.pixelbeat.org/docs/unicode_utils/

Written on 24 July 2012.
« The history of booting Linux with software RAID
My dislike for what I call 'perverse Test Driven Development' »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Tue Jul 24 00:25:25 2012
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.