The quest for the mythical C.UTF-8 locale

November 8, 2006

Recently, Pete Zaitcev wrote in passing:

Now if only someone designed a UTF-8 locale which did not screw the ordering files in ls output...

What he said. I've come to realize that what I want what I'll call the 'C.UTF-8' locale: all of the old-fashioned Unix non-locale behavior, but with non-ASCII characters encoded in UTF-8. I don't mind UTF-8 too much and it's clearly the future, but I don't want anything else that gets bundled up as a 'locale', and I especially don't want crazy sorting.

Having spelunked glibc info docs and done some experimentation, there are several useful environment variables for achieving something like this:

  • LC_CTYPE sets the output and input character encoding, although some programs have their own overrides that may also need fiddling (eg, less has LESSCHARSET)
  • LC_COLLATE sets the collation order, which controls how ls (and the shell's echo and so on) order files, among other annoyances.
  • LC_MESSAGES sets what language messages appear in. It does not appear to set an implicit default output character encoding, so you must set LC_CTYPE as well for anything that needs non-ASCII characters.

LANG sets global values for these, overridden by the more specific versions; LC_ALL sets all of them, overriding everything else.

Linux glibc is smart enough to convert from message character encodings to output character encodings, even for relatively complicated things like Chinese error messages. On the other hand, it's kind of daunting to think about how much code gets invoked when ls prints an error message.

(I observe in passing that it's very handy to have a graphical program that deals only in UTF-8 and some UTF-8 files when testing this sort of thing. That way you can be sure that things really are generating UTF-8 or are in a UTF-8 display mode.)

I currently run with no LANG et al set, because of past concerns (and because I only work in English, so I can get away with it). Having looked at all this, it's tempting to set LC_CTYPE and step into the modern UTF-8 world. In theory it would be transparent (xterm and vi and GNU Emacs and so on seem to correctly switch into UTF-8 mode without further poking), and it'd mean I'd stop seeing vaguely mangled manpages every time I ssh into a normal modern Linux system.


Comments on this page:

By DanielMartin at 2006-11-08 12:15:07:

I have yet to see in all your locale-related annoyances any evidence that the override order is anything other than what is explicitly documented: (even if some of your commentors get the order wrong off the top of their head)

 System default("C") is overridden by
 LANG, which is overriden by
 the usage-specific LC_* setting, which is overriden by
 LC_ALL

Given that, why not set LANG to en_US.UTF-8 and set LC_COLLATE to C? Or, if you're feeling paranoid, set LANG to C and LC_CTYPE to en_US.UTF-8?

By cks at 2006-11-08 17:31:10:

My concerns now aren't with setting LC_CTYPE but if anything odd would start to happen in my environment when I went UTF-8, and how well things like our Solaris 8 boxes would cope. The change is a fairly big lurch, and I usually alternate between rather nervous of those and screamingly gung-ho, more or less randomly.

(Since all I want is LC_CTYPE, setting just it strikes me as the best approach.)

Written on 08 November 2006.
« Link: Unicode Spaces
Link: Pumas on Hoverbikes »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed Nov 8 00:38:51 2006
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.