Chris's Wiki :: blog/python/UnicodeArrogance Commentshttps://utcc.utoronto.ca/~cks/space/blog/python/UnicodeArrogance?atomcommentsDWiki2012-07-17T15:09:53ZRecent comments in Chris's Wiki :: blog/python/UnicodeArrogance.By Chris Siebenmann on /blog/python/UnicodeArrogancetag:CSpace:blog/python/UnicodeArrogance:1431a926c6a1278a50d32da473a4704fd612ad06Chris Siebenmann<div class="wikitext"><p>Even for filenames, I think that a lot depends on the specifics of the
encoding. In Unix you can get away with an encoding that avoids ever
generating either null bytes or a '<code>/</code>' (and ideally avoids generating
control characters), and I believe that at least some CJK legacy
encodings are structured like this.</p>
<p>(Technically the thing that destroys my arrogant approach is an
encoding that either generates significant runs of ASCII characters
or encodes the basic Latin characters in some other way than as the
usual ASCII characters. You can get into corner cases like <a href="http://en.wikipedia.org/wiki/Shift_JIS">Shift JIS</a> where you're missing a few of
the punctuation characters.)</p>
</div>2012-07-17T15:09:53ZFrom 89.243.102.89 on /blog/python/UnicodeArrogancetag:CSpace:blog/python/UnicodeArrogance:a335b754a2b296fa8fd2a4922e48a2aaa63f51f7From 89.243.102.89<div class="wikitext"><p>Looks like that applies to all the CJK legacy encodings.</p>
<p>I thought I'd seen someone stating that non-ASCII locales were explicitly not supported... I assume it applied to Linux only. But it would tend to break any string exchanges with a kernel API. dmesg, mount options, sysfs attributes... the shebang, and PATH.</p>
<p>POSIX almost outright warns you not to do it. The results of switching between locales that code characters in the "portable character set" differently is "unspecified". And the portable character set is something like 85% of ASCII. (Hopefully it's just the controls they elided. I haven't checked for sure).</p>
<p>I don't think you should feel guilty about it on UNIX. Windows is different - I don't think your original argument applies there, because Windows is much more serious about filenames being Unicode.</p>
<pre>
- Alan
</pre>
</div>2012-07-16T20:56:22Z