HTML character sets

November 6, 2006

I started with a simple question that was vaguely orbiting in the back of my mind: are HTML numeric character entities always in Unicode, regardless of the character set of the HTML web page? The answer is yes, and that I was asking a misleading question.

Despite being called 'charset' in HTTP headers and META declarations, what you are declaring is actually the web page's character encoding, not its character set. HTML is specified as always being in the Unicode character set, although the encoding of characters in the document can vary, so numeric character entities are always in Unicode.

(All of this can be found in the W3C spec, which is even relatively clear.)

Browsers display HTML by at least logically converting the incoming web page into Unicode and then figuring out how to render all of the characters. If you do not have Unicode fonts for everything, Firefox will hunt around through the fonts that you do have in various character set encodings, using its charset to Unicode maps in reverse to find one that has the necessary Unicode character. Interesting things happen if the fonts you have do not have all the characters that Firefox expects them to have.

(This is probably not an issue to people who are using stock Firefox builds with relatively stock fonts on modern Unix systems. I use a custom-compiled Firefox with a wacky set of old-school bitmap fonts as my default fonts, so periodically various 'smart quote' characters drop out on me and I get to go on another hunting expedition.)

Written on 06 November 2006.
« An interesting Python garbage collection bug
Link: Unicode Spaces »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Nov 6 13:19:00 2006
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.