A trick for dealing with irregular multi-word lines in shell scripts

March 1, 2012

Suppose that you have a bunch of lines in what I've sort of described as a 'key=value' format, that look like this:

<timestamp> key1=value1 key2=value2 key3=value3 ...

Also, let's suppose that the fields and their ordering isn't constant, for example some lines omit key2 and its value. If it wasn't for this inconsistency, there's lots of Unix tools that you could use; with this inconsistency, I can't think of a Unix program that naturally deals with this format (one where you can say 'give me key1 and key7' in the same easy way you can get field 1 and field 7 in awk).

Fortunately, Unix gives us some brute force tricks.

Selecting lines based on field contents is pretty easy:

grep ' key1=[^ ]*example'

(The space before the key name may not be necessary depending on what key names your file uses.)

I don't have any clever tricks if you want to aggregate or otherwise process several fields, but if you just want to pull out and analyze one field there is a brute force trick that you can often use. Let me show you a full command example:

egrep ' p=(1|0.9)' | tr ' ' '\012' | grep '^f=' | sed 's/.*@//' | howmany | sed 20q

The important trick is the tr combined with the grep. The tr breaks each log file line apart so that each 'key=value' pair is on its own line (by turning the spaces that separate fields into newlines). Once each key=value pair is on a separate line, we can select just the field we want and process it. Meanwhile the initial egrep is selecting which whole lines we want to work on before the tr slices everything apart.

Of course, you don't necessarily need the lines to be in 'key=value' format. A variant of this 'split words into separate lines' trick can be done to any file format where you can somehow match the individual 'words' that you want to further process. And you don't have to split on spaces; any distinguishing character will do.

(If the field separator is several characters you can split things with sed. I used tr here because it's simpler for single-character splitting.)

I call this brute force because we're not doing anything particularly clever to extract just the words we care about from inside each line. Instead we're slicing up everything and then throwing most of the pieces away.

Written on 01 March 2012.
« SSDs and understanding your bottlenecks
Two ways I increase the security of SSH personal keys »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Thu Mar 1 22:53:18 2012
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.