When can you assume UTF-8 filenames?

January 31, 2009

Here is an interesting question: when is it safe to assume that all of the filenames on your Linux machine are encoded in UTF-8?

The simple answer: it's only safe to make this assumption when you (and your system) have lived entirely within a hermetically sealed UTF-8 only bubble, never coming in touch with filenames from outside that bubble. Now, this is a pretty big bubble and it is slowly expanding, given that basically all new Linux machines default to UTF-8, but it still a bubble, and there are still lots of things outside it.

(Thus, if you are writing general software the actual answer is 'never'.)

Unfortunately this is an easy mistake to make. If you live within the bubble and are sufficiently far from its edges that they are out of sight, you can be ignorant of its existence (and many people probably are). And even if you aren't exactly ignorant of its existence, you can still be overly optimistic about the size of the bubble.

(It's also possible that I'm being overly pessimistic about the size of the bubble. But I don't think that UTF-8 only systems are anywhere near as universal as people would like them to be, and I do think that they are fragile; there are lots of ways for 'bad' filenames to seep into the bubble, including various programs that make no attempt to guess filename encodings and transcode filenames into valid UTF-8 when they unpack archives.)

Or in short: if everything you see is UTF-8, it is easy to assume that everything in general is UTF-8.

(See also why the kernel shouldn't try to enforce UTF-8 only filenames.)

Written on 31 January 2009.
« A surprising lack on Linux: browsers for camera RAW photos
Why social mudding works »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sat Jan 31 00:37:10 2009
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.