2010-08-22
Another reason to hate $LANG
and locales on Unix
Sometimes I'm slow; only recently did it occur to me how the
$LANG
sort misfeature and GNU comm's misfeature combine in an orgy of annoyance in a heterogenous
environment.
Suppose that you have systems that changed their default locale
between operating system versions. As part of routine processing, you use comm
to get the difference between
something on the local system and a global list. Well, oops. Even if you
carefully use sort
on both versions, you are going to have problems.
As we saw earlier, the choice of locale may change the sort
order. While GNU comm is locale aware in just the same way as sort
, it
is not aware of multiple locales; it assumes that all files are sorted
in the current locale (and these days it actively requires it). So your
global file, although sorted, may not be sorted in the current system's
locale, which will cause comm
both to complain and to fail.
(You get the same effect if you generate different global files on different machines and then try to process them together.)
Effectively this means that there is no such thing as a globally visible
file that is properly sorted, because what 'properly sorted' is is
different on different machines. Instead you probably want to sort all
files on the local machine, which means making copies of the global
ones. Ideally you want to do this right before using them, because the
locale may differ between various environments even on a single machine;
it simply safer to sort files in the script immediately before feeding
them to comm
, so you know that sort
and comm
were both running in
the same locale.
(Offhand, there are at least four plausible environments where system
scripts might run with a different locale: from init.d scripts at boot
time, from crontab entries, from an interactive login, and from an
automated ssh
command invocation that passes along the other machine's
locale.)