The problems with enforced UTF-8 only filenames

June 21, 2005

Last week, one of the Linux kernel mailing list's perennial issues resurfaced: the great filename character set issue, where people debate what (if anything) the Linux kernel should do about what character set (or character sets) filenames are in, how the kernel can insure that everyone sees the same filename, and so on.

As usual, someone suggested that the Linux kernel should require that Linux filenames are all valid UTF-8 strings. This is a superficially attractive idea; applications would always know what to do with filenames, and they'd never have to deal with filenames that are malformed UTF-8 (not all byte sequences are valid UTF-8).

And it's completely unworkable.

The big problem is, what do you do with filenames that aren't already in UTF-8? The world is not such a nice place that all of the filenames on Linux systems are already in UTF-8; if it was, we wouldn't have this problem.

If the kernel takes this seriously, it has to refuse to accept or give out filenames that aren't valid UTF-8. This means that all of those files already on your system instantly become inaccessible; they don't appear in directory listings, they can't be opened, and so on.

The first usual proposal is that each mounted filesystem should have a 'native character set' option, to say what character set its filenames were actually in; the kernel then maps back and forth between UTF-8 and the native character set. The problem is that real filesystems today have filenames in more than one character set.

The second usual proposal is that one should rename all of the filenames to be in UTF-8. This is subject to a whole lot of problems, including trying to figure out what character set each filename is in. (I have filenames that I'm not even sure what character set they're in.)

Worse, filenames already on the system aren't the only problem. People import files onto the system all the time, and there's no guarantee at all that where they're getting those files and their filenames from is also using UTF-8 (use of UTF-8 for filenames is by no means universal). And when that happens, the extraction may fail with an 'invalid filename' error. To fix this, the user needs to either persuade the other end to rename the files or to get a new version of the program they're using, one that can rename the files (somehow).

The odds of all of the sources of remote file blobs cooperating this way are low. The odds of the new program versions are somewhat better (unless this is commercial software, such as a commercial backup solution), but there are a lot of such programs, many more than most people realize. That means this is going to be a lot of work and it is not going to happen very fast, and people will probably be tripping over this for quite a while.

(And if you want more fun, consider read-only media being imported from other systems, which is subject to all of the problems. It may have multiple character sets, you can't rename the files, and there's no local program to patch.)

In summary:

So, to summarize: enforced UTF-8 only filenames is a huge amount of ongoing pain and breakages in exchange for a dubious gain. Workable? Maybe in theory. Not in practice, unless you and your users enjoy frustration.

What character set filenames are in is a policy decision, not a technical decision. As a policy decision, the kernel should stay out of the whole area; things work much better when the kernel treats filenames as strings of bytes and nothing more. (There are also benefits if it's ever decided that perhaps UTF-8 is not the ideal character set encoding after all.)

(Sometimes people propose the kernel enforcing UTF-8 only as a way to get people to move to UTF-8. This strikes me as even more stupid; beating people over the head with a club is not the way to make them cooperate or like you.)


Comments on this page:

The big problem is, what do you do with filenames that aren't already in UTF-8? The world is not such a nice place that all of the filenames on Linux systems are already in UTF-8; if it was, we wouldn't have this problem.

That should be if it were. Anyway, this is a natural problem, where foresight be missing. Now, being fair, treating filenames as octets is a bounded domain, although not nearly so clear a domain as desired by the system's creators.

What character set filenames are in is a policy decision, not a technical decision. As a policy decision, the kernel should stay out of the whole area; things work much better when the kernel treats filenames as strings of bytes and nothing more. (There are also benefits if it's ever decided that perhaps UTF-8 is not the ideal character set encoding after all.)

The kernel stays out of the way, except when it comes to / and NULL, the former for silly directory reasons and the latter because the C language can't bear not being catered to. Part of why UTF-8 will never be an ideal is because it similarly takes pains to avoid those two ASCII values in other contexts, for such poor reasons.

(Sometimes people propose the kernel enforcing UTF-8 only as a way to get people to move to UTF-8. This strikes me as even more stupid; beating people over the head with a club is not the way to make them cooperate or like you.)

This is particularly amusing due to the later zealotry with which UTF-8 has been forced on others, ironically making an even greater issue out of this octet filename nonsense.

By Miksa at 2021-12-17 06:06:58:

I switched my own fileserver from ISO-8859-15 to UTF-8 over a decade ago. I handled that with

find . -maxdepth 1 -regextype posix-egrep -not -regex '[[:alnum:][:punct:][:space:]]*'

and

to_name=`echo "$from_name" | iconv --from-code $from_conv --to-code $to_conv`

I feel that after the conversion I've had less problems. Every now and then I used to come across filenames that ISO couldn't handle, that showed as for example ls listings where filenames could have newlines in weird spots.

Written on 21 June 2005.
« Small details can matter (or: a little nifty Python bit)
Future Sysadmin Jobs »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Tue Jun 21 04:00:51 2005
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.