Wandering Thoughts archives


Unicode code points and abstract characters

In yesterday's entry I mentioned in passing that Unicode code points were both more and less than abstract characters. Since this is the kind of statement that might raise people's eyebrows, I figure that I should explain (and justify) it.

Unicode code points are more than just abstract characters because of the presence of combining characters. Combining characters turn Unicode code points into a system for assembling an abstract character from components; you have the base character and then various accent marks and suchlike added on to it.

(I don't consider Unicode direction marks and other zero-width formatting characters to be on the same level as combining characters. My impression is that you can not use the formatting characters, while combining characters are a fundamental part of how you use Unicode.)

Unicode code points are less than abstract characters because of Han unification. I say this because the practical upshot of Han unification is that a Unicode code point by itself is not enough information to display the right glyph to someone in all circumstances; you also need to know their locale. This makes the real abstract character the combination of the Unicode code point and the locale information.

(Some of this is a matter of people's preferences, but there are apparently at least some cases where people will not recognize pr understand the wrong glyph. See the discussion and the links in my old entry on how Unicode is not simple. The whole issue is complicated and contentious.)

tech/UnicodeVersusCharacters written at 01:18:38; Add Comment

Page tools: See As Normal.
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.