Bash is letting locales destroy shell scripting (at least on Linux)

July 2, 2014

Here, let me present you something in illustrated form, on a system where /bin/sh is Bash:

$ cat Demo
#!/bin/sh
for i in "$@"; do
  case "$i" in
    *[A-Z]*) echo "$i has upper case";;
  esac
done
$ env - LANG=en_US.UTF-8 ./Demo a b C y z
b has upper case
C has upper case
y has upper case
z has upper case
$ env - LANG=en_US.UTF-8 /bin/dash ./Demo a b C y z
C has upper case
$ env - ./Demo a b C y z # no locale
C has upper case
$

I challenge you to make sense of either part of Bash's behavior in the en_US.UTF-8 locale.

(Contrary to my initial tweet, this behavior has apparently been in Bash for some time. It's also somewhat system dependent; Bash 4.2.25 on Ubuntu 12.04 behaves this way but 4.2.45 on FreeBSD doesn't.)

There is no two ways to describe this behavior: this is braindamaged. It is at best robot logic on Bash's part to allow [A-Z] to match lower case characters. It is also terribly destructive to bash's utility for shell scripting. If I cannot even count on glob operations that are not even in a file context operating sanely, why am I using bash to write shell scripts at all? On many systems, this means eschewing '#!/bin/sh' entirely because (as we're seeing here) /bin/sh can be Bash and Bash will behave this way even when invoked as sh.

(I have to assume that not matching a as upper case is a Bash bug but that the rest of the behavior is intended. It makes more sense than the other way around.)

What Bash has done here is to strew land mines in the way of my scripts working right in what is now a common environment. If I want to continue using shell scripts I have to start trying to defensively defeat Bash. What will do it? Today, probably setting LC_COLLATE=C or better yet LC_ALL=C. In all of my scripts. I might as well switch to Python or Perl even for small things; they are clearly less likely to cause me heartburn in the future by going crazy.

There's another problem with this behavior, which is that it is not what any other POSIX-compatible shell I could find does (on Ubuntu 14.04). Dash (the normal /bin/sh on many Linuxes), mksh, ksh, and even zsh don't match here. This means that having Bash as /bin/sh creates a serious behavior difference, not just adds non-POSIX features that you may accidentally (or deliberately) use in '#!/bin/sh' scripts.

(Yes, yes, I've written about this before. But the examples back then were vaguely sensible things for locales to apply to. What is happening in the Demo script is very, very far over the line. What is next, GNU grep deciding that your '[A-Z]' should match case-independently in some locales? That's just as justified as what Bash is doing here.)

PS: This is actually making me rethink the idea of having /bin/sh be Bash on our Ubuntu machines, which is the case for historical reasons. The pain of rooting out bashism from our scripts may be less than the pain of dealing with Bash braindamage.

Sidebar: the bug continues

If you change the [A-Z] to [a-z] and try Demo with all upper case letters, it will match A-Y but think Z doesn't match. This is symmetrical in what you could consider a weird way. A quick test suggests that all other letters besides 'a' (in the [A-Z] case) and 'Z' (in the [a-z] case) match 'correctly', if we assume that a case independent match is correct in the first place.

Because I was masochistic tonight this has been filed as GNU Bash bug 108609 (tested against bash git tip), although savannah.gnu.org may have eaten the actual text I put in (it sent the text to me in email but I can't read the text through the web). My bug is primarily to report the missing 'a' and 'Z' and only lightly touches on the whole craziness of [A-Z] matching any lower case characters at all, so I encourage other people to file their own bugs about that. I have opted for a low-stress approach myself since I don't expect my bug report to go anywhere.


Comments on this page:

By Sergey Vlasov at 2014-07-03 02:08:18:

What is next, GNU grep deciding that your '[A-Z]' should match case-independently in some locales?

Apparently it already does with some versions of GNU grep and glibc.

The POSIX spec says:

7. In the POSIX locale, a range expression represents the set of collating elements that fall between two elements in the collation sequence, inclusive. In other locales, a range expression has unspecified behavior: strictly conforming applications shall not rely on whether the range expression is valid, or on the set of collating elements matched.

If the specification for the POSIX locale is applied to other locales, you get the behavior you see in bash — in most locales Latin letters are ordered as aAbBcCzZ, therefore [A-Z] matches all uppercase and lowercase Latin letters except “a” (it will also match many kinds of accented letters). For some languages you may get even more confusing results — e.g., Ch” may be treated as a single letter which is ordered nowhere near “C.

So you can either force LC_ALL=C (then [A-Z] will match just uppercase Latin letters), or use character classes (e.g., [[:upper:]] will match characters defined as uppercase in the current locale — Latin, Cyrillic, …).

Sergey:

So to be explicit – the fact that [A-Z] matches all of 'b'..'z' but not 'a', and the corresponding symmetry that [a-z] matches 'A'..'Y' but not 'Z' is conforming, i.e. bash is in fact correctly implementing the spec and Chris’ issue is strictly speaking NOTABUG?

There is nothing else to call this but batshit insane.

Chris:

Are [[:upper:]] and [[:lower:]] afflicted with crazy too?

That's horrible. I'm sure my site would have been burned by now if we didn't use LANG=C for historical reasons.

FWIW, on Fedora/RHEL there's a standard location to override locale settings: /etc/sysconfig/i18n and ~/.i18n

e.g. LC_COLLATE=POSIX

By cks at 2014-07-03 13:53:12:

I've now tested [[:upper:]] and it seems to work correctly. I assume [[:lower:]] will as well. Dash supports them so I assume they're generally POSIX. I can't say I'm enthused about the shift and it doesn't cover subsets of upper case (or lower case).

(At some point soon we won't have any machines where /bin/sh is not a POSIX shell, but for now we're stuck with Solaris machines.)

Search for "GNU Rational Range Interpretation".

Recently GNU awk, grep etc. have been changing so that a-z means just that. bash supports that too by setting globasciiranges Chet the bash maintainer tells me that will be the default soon.

I think the explanation of why this is NOTABUG is very reasonable.

And this post is very interesting.

So: why not just set the locale to C within your shell script?

By cks at 2014-07-03 17:39:00:

Because it is not one shell script, it is every shell script. Bash's handling of locales here taints every shell script and forces every shell script to work around it. This is a terrible decision and a terrible idea; it puts all of the work in exactly the wrong place and insures that even one error or omission can have explosive results. When you have a central problem the right place to solve it is in the center, not all over the periphery.

Written on 02 July 2014.
« Why Solaris's SMF is not a good init system
An interesting Go concurrency bug that I inflicted on myself »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed Jul 2 23:02:44 2014
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.