Chris's Wiki :: blog/tech/UnicodeIsNotSimple Commentshttps://utcc.utoronto.ca/~cks/space/blog/tech/UnicodeIsNotSimple?atomcommentsDWiki2006-03-05T03:07:22ZRecent comments in Chris's Wiki :: blog/tech/UnicodeIsNotSimple.By DanielMartin on /blog/tech/UnicodeIsNotSimpletag:CSpace:blog/tech/UnicodeIsNotSimple:05a512a7c259fb2530b1d28c3408c1bb04795596DanielMartin<div class="wikitext"><p>Especially note that the various totally-denormalized encoding possibilities of utf-8 have long been used to trigger path-traversal exploits on windows machines.</p>
<p>These days, you get spammers encoding their stuff in non-normal utf8, using <a href="http://en.wikipedia.org/wiki/Punycode#Spoofing_concerns">Punycode</a>-encoded domain names to spoof official names, and doing things like writing "administrators" as "a<RtoL>rotartsinimd<POPdir>s", where "RtoL" is the unicode for "begin right-to-left text" and "POPdir" is the unicode character for "pop direction instruction". In other words, unicode opens up huge new areas that are seriously hard to secure and nail down. Once you update your anti-spam engine for this abuse of unicode by spammers, they discover another uncharted backwater of the unicode standard that lets them mangle their message, and another, and another, etc. This leads to more processing power needed all around to decode unicode into something normal, and the upwards march of processing power and abstraction layers goes on.</p>
<p>Unicode is an amalgamated standard. That is, it's a bunch of rather different bits glommed and stuck together, so making assertions about what text of a certain form can and cannot be becomes incredibly difficult. (and possibly changes, as the standard is updated) Partially, this is because the problem space it's trying to solve - represent all the world's languages - is unbelievably ill-defined and large. The different languages of the world just don't have that much in common with each other, so of course you end up with things mixed together. Also, the pre-existing stnadards unicode is expected to interoperate with don't always fit the unicode data model of distinguishing carefully between codepoint, glyph, and combining mark.</p>
<p>As a result, you end up with a character set that is almost a programming language, and we all know the problem about trying to nail down what a given chunk of code is going to do before you execute (i.e. display) it.</p>
</div>2006-03-05T03:07:22Z