A trick for dealing with irregular multi-word lines in shell scripts

March 1, 2012

Suppose that you have a bunch of lines in what I've sort of described as a 'key=value' format, that look like this:

<timestamp> key1=value1 key2=value2 key3=value3 ...

Also, let's suppose that the fields and their ordering isn't constant, for example some lines omit key2 and its value. If it wasn't for this inconsistency, there's lots of Unix tools that you could use; with this inconsistency, I can't think of a Unix program that naturally deals with this format (one where you can say 'give me key1 and key7' in the same easy way you can get field 1 and field 7 in awk).

Fortunately, Unix gives us some brute force tricks.

Selecting lines based on field contents is pretty easy:

grep ' key1=[^ ]*example'

(The space before the key name may not be necessary depending on what key names your file uses.)

I don't have any clever tricks if you want to aggregate or otherwise process several fields, but if you just want to pull out and analyze one field there is a brute force trick that you can often use. Let me show you a full command example:

egrep ' p=(1|0.9)' | tr ' ' '\012' | grep '^f=' | sed 's/.*@//' | howmany | sed 20q

The important trick is the tr combined with the grep. The tr breaks each log file line apart so that each 'key=value' pair is on its own line (by turning the spaces that separate fields into newlines). Once each key=value pair is on a separate line, we can select just the field we want and process it. Meanwhile the initial egrep is selecting which whole lines we want to work on before the tr slices everything apart.

Of course, you don't necessarily need the lines to be in 'key=value' format. A variant of this 'split words into separate lines' trick can be done to any file format where you can somehow match the individual 'words' that you want to further process. And you don't have to split on spaces; any distinguishing character will do.

(If the field separator is several characters you can split things with sed. I used tr here because it's simpler for single-character splitting.)

I call this brute force because we're not doing anything particularly clever to extract just the words we care about from inside each line. Instead we're slicing up everything and then throwing most of the pieces away.


Comments on this page:

From 146.6.208.17 at 2012-03-02 14:22:02:

I suppose I don't understand the problem... What's wrong with:

grep -o '[^ ]\+=[^ ]\+

For example, This extracts the key=values and ignores 'the other stuff.'

echo "<timestamp> key1=value1 the other stuff key2=value2 key3=value3" \
| grep -o '[^ ]\+=[^ ]\+'
From 146.6.208.17 at 2012-03-02 14:25:16:

Oops. forgot to paste in the closing quote on the grep regex

Should be:

grep -o '[^ ]\+=[^ ]\+'
From 70.26.88.56 at 2012-03-02 17:34:08:

It should be possible to use awk:

for ( n=1; n<=NF; n++) {
   if ($n ~ /^keyX=/) {
       print $n;
   }
}

If you want the value of the "keyX=valueX" pair, you could use 'split($n, ARRAY ,"=")', and then print out ARRAY[2].

I sometimes use something along these lines when dealing with LDIF: change the record separator (RS) from a single new line to double ("\n\n"), and the field separator from space to a single newline. Then each LDIF block/stanza in a record, and each is a field. Search for a particular attribute-value pair is just a matter of looping through NF as above. (This breaks if you have multiline LDAP attributes though.)

By cks at 2012-03-02 18:25:24:

The quick answer about 'grep -o' is that I didn't know about the -o option until now. It's clearly a great way to handle this:

egrep -o ' (key1|key2|key3)=[^ ]*'

While awk and a bunch of other tools can deal with this format, they can't do so anywhere nearly as natural as working with fixed fields, where you can do just awk '{print $1, $7}'.

Written on 01 March 2012.
« SSDs and understanding your bottlenecks
Two ways I increase the security of SSH personal keys »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Thu Mar 1 22:53:18 2012
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.