== The quest for the mythical _C.UTF-8_ locale Recently, Pete Zaitcev [[wrote in passing http://zaitcev.livejournal.com/100307.html]]: > Now if only someone designed a UTF-8 locale which did not screw > the ordering files in ls output... What he said. I've come to realize that what I want what I'll call the '_C.UTF-8_' locale: all of the old-fashioned Unix non-locale behavior, but with non-ASCII characters encoded in UTF-8. I don't mind UTF-8 too much and it's clearly the future, but I don't want anything else that gets bundled up as a 'locale', and I especially don't want [[crazy sorting ../sysadmin/LANGHate]]. Having spelunked glibc info docs and done some experimentation, there are several useful environment variables for achieving something like this: * ((LC_CTYPE)) sets the output and input character encoding, although some programs have their own overrides that may also need fiddling (eg, _less_ has _LESSCHARSET_) * ((LC_COLLATE)) sets the collation order, which controls how _ls_ (and the shell's _echo_ and so on) order files, among [[other annoyances ../sysadmin/LANGHate]]. * ((LC_MESSAGES)) sets what language messages appear in. It does not appear to set an implicit default output character encoding, so you must set ((LC_CTYPE)) as well for anything that needs non-ASCII characters. _LANG_ sets global values for these, overridden by the more specific versions; ((LC_ALL)) sets all of them, overriding everything else. Linux glibc is smart enough to convert from message character encodings to output character encodings, even for relatively complicated things like Chinese error messages. On the other hand, it's kind of daunting to think about how much code gets invoked when _ls_ prints an error message. (I observe in passing that it's very handy to have a graphical program that deals only in UTF-8 and some UTF-8 files when testing this sort of thing. That way you can be sure that things really are generating UTF-8 or are in a UTF-8 display mode.) I currently run with no ((LANG)) et al set, because of [[past concerns ../sysadmin/LANGHate]] (and because I only work in English, so I can get away with it). Having looked at all this, it's tempting to set ((LC_CTYPE)) and step into the modern UTF-8 world. In *theory* it would be transparent (xterm and vi and GNU Emacs and so on seem to correctly switch into UTF-8 mode without further poking), and it'd mean I'd stop seeing vaguely mangled manpages every time I ssh into a normal modern Linux system.