GNU sort and -k: a gotcha

November 5, 2010

Sometimes I don't like GNU utilities.

Suppose that you have a file that looks like this:

oxygen     fred
nitrogen   bob
xenon      fred
carbon     cks
iron       jim

Further suppose that you want to sort it by the second field, to group machines by the users that own them. So of course you do 'sort -k2 file', because that's the obvious answer. Except that it doesn't work; it sorts in some peculiar, non-obvious way, and it's not that you need to specify field 1 or field 3 or anything like that. Perhaps you scratch your head, grind your teeth, and move on. (That's what I did until recently.)

Congratulations, you've been hit by a GNU sort gotcha; sort doesn't define fields the way you think it does. Pretty much every other sensible Unix program that deals with multi-field lines says that fields are separated by one or more whitespace characters. GNU sort, just to be different, says that fields are not so much separated by whitespace characters but created by whitespace characters and the whitespace characters become part of the next field.

(This is spelled out in the info document for GNU sort in the section on the -t argument. Read it carefully.)

This works out the way you innocently expect if each line separates fields with the same number of whitespace characters, or if you are using -n even with a variable number of whitespace characters (at least in my testing). It goes off the rails badly in cases like this example, where fields are separated by a variable number of whitespace characters.

The solution is to add the -b argument, which makes GNU sort work the way you expect it to. I am tempted to make an alias (well, a cover script) that always supplies -b, because I can't think of any situation where I don't want this sane behavior.

(GNU sort's behavior is in fact in violation of the Single Unix Specification for sort; see the description of the -t option.)


Comments on this page:

From 66.134.136.66 at 2010-11-05 02:39:45:

I think this is something GNU picked up from vendor versions of sort, and though it may violate the specification, I think it actually makes sense.

If a maximal sequence of blanks is the default field separator, does this mean that blanks at the start of a line introduce a null initial field? The specification makes it pretty clear they don't, and I agree with this. If blanks at the start of a line are not a separator, are they part of the first field, but not other fields? Are they just ignored? Either requires treating blanks at the start of a line differently from blanks elsewhere, and I hate having to remember exceptions.

Aside from consistency, one functional advantage of treating leading blanks as part of a key field is that it allows sorting blank-padded integer values without having to indicate that the field is numeric. It also works for octal and hexadecimal without having to do anything special.

From 180.0.195.73 at 2010-11-08 07:32:15:

In the Single Unix Specification page that you link to, look at the section "APPLICATION USAGE". I think GNU sort is just conforming to the standard here.

By cks at 2010-11-08 09:39:12:

The APPLICATION USAGE section's example appears to contradict the wording of the description of the -t option. Unless there is some way to reconcile them, the standard text wins; the APPLICATION USAGE section is explicitly labeled as (merely) informative.

From 84.203.137.218 at 2010-12-06 08:22:52:

Note using -b is ineffective if you specify per key types. Also sort -k1b,1 is different to -k1,1b. To help with this mess (caused by backwards compat) you can use the new sort --debug option added in coreutils 8.6 which highlights the part of the line significant in the sort, and warns about questionable options.

Thanks for the excellent blog, BTW!

From 84.203.137.218 at 2010-12-06 08:48:28:

Hmm I forgot I wrote a blog post myself about this:

http://www.pixelbeat.org/patches/coreutils/sort-debug/

This shows the key annotation and example warnings.

Written on 05 November 2010.
« What we (currently) use virtualization for
Modern versions of Apache and Redirect »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Nov 5 00:54:33 2010
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.