Unicode is not simple

March 1, 2006

Unicode is very big these days, and there are a lot of people who will tell you that Unicode is simple once you take the effort to understand it and that you're a parochial spud if your new program doesn't support it. Unfortunately, only simple Unicode is easy.

The simple vision of Unicode is that once you have your data in Unicode, you're just dealing with characters and your program can just do stuff with them as normal. Okay, case and collation is hard, but there are OS and library services for that, and rendering has some interesting challenges, but again the OS people have done that for you.

The problem is that this simple vision isn't true. It would be true if there was a one to one mapping between Unicode codepoints and glyphs, but at least three things I know about get in the way of that:

  1. zero-width formatting characters, which mean some codepoints aren't glyphs.
  2. combining characters, which mean that there are multiple ways to make the same glyph.
  3. Han unification, which means some codepoints have to display significantly different glyphs depending on what language the text is in.

(In Unicode terminology, a codepoint is a single bit of Unicode, like U+0061; a character is the abstract thing that this codepoint represents, like LATIN SMALL LETTER A; and a glyph is the (abstract) visual representation of a character, like 'a'.)

Zero width formatting characters are things like zero width spaces or text direction markers. They make it harder to divide up or truncate words (you'll get very funny results in some cases) and easier to create strings that are represented differently but look the same to users.

Some glyphs are represented not by a single Unicode character but by a base character plus combining characters such as accents. Many common accented glyphs can be represented more than one way; there is a 'precomposed character' for them, plus one or more composite forms. For example, å is both U+00E5 and U+0061 plus U+030A.

Among other things, this means that a correct program must normalize Unicode strings before comparing them, using its choice of four different normalization forms (see, for example, here; read the comments, they're informative). Also, your code can't just blindly lop codepoints off words to do stuff; if you do, you can turn Årne Svensen into A. Svensen, and he may not be too pleased with that.

(See also the Unicode normalization FAQ or Markus Kuhn's FAQ.)

The really troublesome one is Han unification. As part of Han unification, the same codepoint was assigned to the same logical character in the CJK languages, even if different languages used somewhat different glyphs for the character. For example, U+8349 is the 'grass' character for all the CJK languages, but Traditional Chinese uses a different glyph for it. Thus, to properly display something that includes U+8349 to users, you must know what language that section of text is written in. This isn't just a theoretical issue; this LiveJournal entry shows the sort of things that do happen in the real world because the language of Unicode text isn't marked.

(Pop quiz: which version of 草 is shown here for you?)

The really dangerous thing about the 'Unicode is simple' meme and simple Unicode handling is that it usually works, especially in Europe. Most of the time you will get Unicode that uses precomposed characters instead of combining characters, or at least has the combining characters in a normalized form. Most of the time Peter Påderson's system will encode his name the same way. Most of the time you will be dealing with monolingual text in the user's own locale, where Han unification issues won't matter. Most of the time you won't get input with deliberately introduced zero width formatting characters.

Most of the time.

Most of the time is no way to tell people to program.

Sidebar: More on Han unification

Han unification is (and was) a politically charged thing, especially as Taiwan and the PRC use different character sets (Traditional versus simplified Chinese). For more information and references on the whole issue:


Comments on this page:

By DanielMartin at 2006-03-04 22:07:22:

Especially note that the various totally-denormalized encoding possibilities of utf-8 have long been used to trigger path-traversal exploits on windows machines.

These days, you get spammers encoding their stuff in non-normal utf8, using Punycode-encoded domain names to spoof official names, and doing things like writing "administrators" as "a<RtoL>rotartsinimd<POPdir>s", where "RtoL" is the unicode for "begin right-to-left text" and "POPdir" is the unicode character for "pop direction instruction". In other words, unicode opens up huge new areas that are seriously hard to secure and nail down. Once you update your anti-spam engine for this abuse of unicode by spammers, they discover another uncharted backwater of the unicode standard that lets them mangle their message, and another, and another, etc. This leads to more processing power needed all around to decode unicode into something normal, and the upwards march of processing power and abstraction layers goes on.

Unicode is an amalgamated standard. That is, it's a bunch of rather different bits glommed and stuck together, so making assertions about what text of a certain form can and cannot be becomes incredibly difficult. (and possibly changes, as the standard is updated) Partially, this is because the problem space it's trying to solve - represent all the world's languages - is unbelievably ill-defined and large. The different languages of the world just don't have that much in common with each other, so of course you end up with things mixed together. Also, the pre-existing stnadards unicode is expected to interoperate with don't always fit the unicode data model of distinguishing carefully between codepoint, glyph, and combining mark.

As a result, you end up with a character set that is almost a programming language, and we all know the problem about trying to nail down what a given chunk of code is going to do before you execute (i.e. display) it.

Written on 01 March 2006.
« Practical RAID-1 read balancing
The :; shell prompt trick »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed Mar 1 17:32:12 2006
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.