Another reason to hate $LANG and locales on Unix

August 22, 2010

Sometimes I'm slow; only recently did it occur to me how the $LANG sort misfeature and GNU comm's misfeature combine in an orgy of annoyance in a heterogenous environment.

Suppose that you have systems that changed their default locale between operating system versions. As part of routine processing, you use comm to get the difference between something on the local system and a global list. Well, oops. Even if you carefully use sort on both versions, you are going to have problems.

As we saw earlier, the choice of locale may change the sort order. While GNU comm is locale aware in just the same way as sort, it is not aware of multiple locales; it assumes that all files are sorted in the current locale (and these days it actively requires it). So your global file, although sorted, may not be sorted in the current system's locale, which will cause comm both to complain and to fail.

(You get the same effect if you generate different global files on different machines and then try to process them together.)

Effectively this means that there is no such thing as a globally visible file that is properly sorted, because what 'properly sorted' is is different on different machines. Instead you probably want to sort all files on the local machine, which means making copies of the global ones. Ideally you want to do this right before using them, because the locale may differ between various environments even on a single machine; it simply safer to sort files in the script immediately before feeding them to comm, so you know that sort and comm were both running in the same locale.

(Offhand, there are at least four plausible environments where system scripts might run with a different locale: from init.d scripts at boot time, from crontab entries, from an interactive login, and from an automated ssh command invocation that passes along the other machine's locale.)

Comments on this page:

From at 2010-08-22 09:24:28:

Given that you're just sorting so you can run comm, why not sort and comm always with LANG=C?

By cks at 2010-08-22 21:23:22:

This is probably the solution we'll have to adopt (in some variation). But it annoys me to have to clutter up all of our scripts with something designed strictly to defeat idiocity instead of doing productive work.

(Especially since it's going to require a comment so we remember why such an odd thing is there.)

Written on 22 August 2010.
« A sudden realization about Unix access time updates and disk mirrors
My (probably wrong) assumption about Flash on Fedora »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sun Aug 22 00:17:56 2010
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.