The problems with enforced UTF-8 only filenames
Last week, one of the Linux kernel mailing list's perennial issues resurfaced: the great filename character set issue, where people debate what (if anything) the Linux kernel should do about what character set (or character sets) filenames are in, how the kernel can insure that everyone sees the same filename, and so on.
As usual, someone suggested that the Linux kernel should require that Linux filenames are all valid UTF-8 strings. This is a superficially attractive idea; applications would always know what to do with filenames, and they'd never have to deal with filenames that are malformed UTF-8 (not all byte sequences are valid UTF-8).
And it's completely unworkable.
The big problem is, what do you do with filenames that aren't already in UTF-8? The world is not such a nice place that all of the filenames on Linux systems are already in UTF-8; if it was, we wouldn't have this problem.
If the kernel takes this seriously, it has to refuse to accept or give out filenames that aren't valid UTF-8. This means that all of those files already on your system instantly become inaccessible; they don't appear in directory listings, they can't be opened, and so on.
The first usual proposal is that each mounted filesystem should have a 'native character set' option, to say what character set its filenames were actually in; the kernel then maps back and forth between UTF-8 and the native character set. The problem is that real filesystems today have filenames in more than one character set.
The second usual proposal is that one should rename all of the filenames to be in UTF-8. This is subject to a whole lot of problems, including trying to figure out what character set each filename is in. (I have filenames that I'm not even sure what character set they're in.)
Worse, filenames already on the system aren't the only problem. People import files onto the system all the time, and there's no guarantee at all that where they're getting those files and their filenames from is also using UTF-8 (use of UTF-8 for filenames is by no means universal). And when that happens, the extraction may fail with an 'invalid filename' error. To fix this, the user needs to either persuade the other end to rename the files or to get a new version of the program they're using, one that can rename the files (somehow).
The odds of all of the sources of remote file blobs cooperating this way are low. The odds of the new program versions are somewhat better (unless this is commercial software, such as a commercial backup solution), but there are a lot of such programs, many more than most people realize. That means this is going to be a lot of work and it is not going to happen very fast, and people will probably be tripping over this for quite a while.
(And if you want more fun, consider read-only media being imported from other systems, which is subject to all of the problems. It may have multiple character sets, you can't rename the files, and there's no local program to patch.)
So, to summarize: enforced UTF-8 only filenames is a huge amount of ongoing pain and breakages in exchange for a dubious gain. Workable? Maybe in theory. Not in practice, unless you and your users enjoy frustration.
What character set filenames are in is a policy decision, not a technical decision. As a policy decision, the kernel should stay out of the whole area; things work much better when the kernel treats filenames as strings of bytes and nothing more. (There are also benefits if it's ever decided that perhaps UTF-8 is not the ideal character set encoding after all.)
(Sometimes people propose the kernel enforcing UTF-8 only as a way to get people to move to UTF-8. This strikes me as even more stupid; beating people over the head with a club is not the way to make them cooperate or like you.)
Comments on this page:Written on 21 June 2005.