Why case independent filenames are a bad idea

January 23, 2006

People keep pushing the idea of case independent filenames (and mocking systems without them), but the whole idea has three big problems: case conversion is a lot more complicated than people think, locking a character set encoding into your OS damages its ability to evolve, and case folding is language specific.

The handy example of the last is Turkish (and Azeri). In Turkish, the capitalization of 'i' is not 'I' but the dotted I (Unicode U+0130); the lowercase version of 'I' is not 'i' but U+0131 (LATIN SMALL LETTER DOTLESS I). If you don't ignore Turkish, your system has some very interesting decisions to make: what happens when a Turkish user creates files calls 'ISTAN' and 'istan', and a German user tries to open 'Istan' and 'iSTAN'?

(Judging from the Unicode SpecialCasing file, Lithuanian may be another example.)

English case folding is simple, but for other languages and character sets it gets tricky. Issues include:

  • single letters can be equivalent to multiple letters.
  • some case folding is context dependent.
  • in Unicode, straight case folding apparently may not preserve proper normalization.

(Since all of this can be beaten to death with enough code, it's only a pragmatic issue. For more fun, apparently Unicode is still revising the case folding rules for some characters.)

Because only characters have 'case', not bytes, the OS has to decide what character set encoding its filenames are in; combined with case independence, this means filenames can't reliably be arbitrary streams. If in the future people want to put filenames in your system that don't fit in your character set encoding, you're in trouble (a mistake that's been made as recently as the people who picked UCS-16). Unicode is generally held to be the last word on this particular issue, but that still leaves you needing an encoding; UTF-8 annoys the Asian languages, while UCS-32 annoys everyone evenly.

(While Unicode has interesting traps for the unwary, things like normalization forms are mostly irrelevant for the narrow issue of case independent filenames.)

Assuming that you brush the whole issue of Turkish under the carpet, supporting 'case independent' filenames still requires a great deal of code (and associated character data), involves a significant amount of fun at runtime with interesting pathological cases, and gives your operating system heartburn if it turns out that Unicode and your chosen Unicode encoding are not actually the last word in character sets.

(The best reference for this I've found is Globalization Gotchas, in the 'Text Transformations' section. A long discussion of Unicode case mappings is in one spot in Unicode standard annex #21, or rolled into the core standard as described here.)

(PS: writing this entry has vividly shown me that I can't spell 'independent' without prodding from spell.)

Written on 23 January 2006.
« Weekly spam summary on January 21st, 2005
A digression on spelling »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Jan 23 02:42:03 2006
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.