Unicode code points and abstract characters

July 19, 2012

In yesterday's entry I mentioned in passing that Unicode code points were both more and less than abstract characters. Since this is the kind of statement that might raise people's eyebrows, I figure that I should explain (and justify) it.

Unicode code points are more than just abstract characters because of the presence of combining characters. Combining characters turn Unicode code points into a system for assembling an abstract character from components; you have the base character and then various accent marks and suchlike added on to it.

(I don't consider Unicode direction marks and other zero-width formatting characters to be on the same level as combining characters. My impression is that you can not use the formatting characters, while combining characters are a fundamental part of how you use Unicode.)

Unicode code points are less than abstract characters because of Han unification. I say this because the practical upshot of Han unification is that a Unicode code point by itself is not enough information to display the right glyph to someone in all circumstances; you also need to know their locale. This makes the real abstract character the combination of the Unicode code point and the locale information.

(Some of this is a matter of people's preferences, but there are apparently at least some cases where people will not recognize pr understand the wrong glyph. See the discussion and the links in my old entry on how Unicode is not simple. The whole issue is complicated and contentious.)

Written on 19 July 2012.
« Strings in Python 2 and Python 3
The temptation of selective sender address verification »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Thu Jul 19 01:18:38 2012
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.