tech/UnicodeIsNotSimple written at 17:32:12; Add Comment
Unicode is not simple
Unicode is very big these days, and there are a lot of people who will tell you that Unicode is simple once you take the effort to understand it and that you're a parochial spud if your new program doesn't support it. Unfortunately, only simple Unicode is easy.
The simple vision of Unicode is that once you have your data in Unicode, you're just dealing with characters and your program can just do stuff with them as normal. Okay, case and collation is hard, but there are OS and library services for that, and rendering has some interesting challenges, but again the OS people have done that for you.
The problem is that this simple vision isn't true. It would be true if there was a one to one mapping between Unicode codepoints and glyphs, but at least three things I know about get in the way of that:
(In Unicode terminology, a codepoint is a single bit of Unicode, like U+0061; a character is the abstract thing that this codepoint represents, like LATIN SMALL LETTER A; and a glyph is the (abstract) visual representation of a character, like 'a'.)
Zero width formatting characters are things like zero width spaces or text direction markers. They make it harder to divide up or truncate words (you'll get very funny results in some cases) and easier to create strings that are represented differently but look the same to users.
Some glyphs are represented not by a single Unicode character but by a base character plus combining characters such as accents. Many common accented glyphs can be represented more than one way; there is a 'precomposed character' for them, plus one or more composite forms. For example, å is both U+00E5 and U+0061 plus U+030A.
Among other things, this means that a correct program must normalize Unicode strings before comparing them, using its choice of four different normalization forms (see, for example, here; read the comments, they're informative). Also, your code can't just blindly lop codepoints off words to do stuff; if you do, you can turn Årne Svensen into A. Svensen, and he may not be too pleased with that.
The really troublesome one is Han unification. As part of Han unification, the same codepoint was assigned to the same logical character in the CJK languages, even if different languages used somewhat different glyphs for the character. For example, U+8349 is the 'grass' character for all the CJK languages, but Traditional Chinese uses a different glyph for it. Thus, to properly display something that includes U+8349 to users, you must know what language that section of text is written in. This isn't just a theoretical issue; this LiveJournal entry shows the sort of things that do happen in the real world because the language of Unicode text isn't marked.
(Pop quiz: which version of 草 is shown here for you?)
The really dangerous thing about the 'Unicode is simple' meme and simple Unicode handling is that it usually works, especially in Europe. Most of the time you will get Unicode that uses precomposed characters instead of combining characters, or at least has the combining characters in a normalized form. Most of the time Peter Påderson's system will encode his name the same way. Most of the time you will be dealing with monolingual text in the user's own locale, where Han unification issues won't matter. Most of the time you won't get input with deliberately introduced zero width formatting characters.
Most of the time.
Most of the time is no way to tell people to program.
Sidebar: More on Han unification
Han unification is (and was) a politically charged thing, especially as Taiwan and the PRC use different character sets (Traditional versus simplified Chinese). For more information and references on the whole issue:
* * *
Atom feeds are available; see the bottom of most pages.