Part of good awk programming is getting the clause order right

December 19, 2012

Today I wanted to extract IP addresses from a recording file that had them in a particular format, in repeated stanzas that looked like:

<date>
---HTTP---
  <IP>
  [...]
---SOMETHING---
[possibly other IPs]

I wanted all the HTTP IPs. I'm sure that somewhere there is a convenient multiline grep but since I didn't have one handy, I reached for awk. As it turns out, solving this problem concisely in awk makes a good example of a possibly underappreciated art in awk programming, namely choosing the order of your clauses.

Here's the awk program I came up with:

/^---/     { p = 0; }
p          { print $0; }
/^---HTTP/ { p = 1; }

Because awk evaluates clauses in order, you can exploit the ordering of clauses as a deliberate part of program logic (and you can also blow your foot off with the wrong choice of order).

Printing of the current line is controlled by p, a flag variable. If p is set, the line is printed; if p is unset, we output nothing. We set p when we see the '---HTTP' leadin, but since we don't want to print the leadin itself (just the following IP addresses) we make this the last clause. The IP addresses end with a '---SOMETHING---' line (but we don't bother to match that much), which causes us to turn off p; since we don't want that line printed, this clause is the first one. Turning off p in the first clause when the line is '---HTTP---' is harmless, because it will get turned on by the final clause. Since it's harmless we don't need a longer match or more complex conditional logic (and this means that we don't actually care which other thing comes after the stanza we care about, so long as there is one).

This also shows the flaw of awk. This program is too clever by half, since it's just indirectly expressing the simple logic:

if (line is '---HTTP---')
   p = 1
elif (line starts with '---')
   p = 0
elif (p)
   print $0

In the name of golfing the awk program a bit I've embedded the elif logic here in the ordering of my clauses, where it's much harder to see than if it was written out simply and plainly.

(For even more fun, you can vary the order of the clauses to control whether the start or end markers will be printed. This would be a much more visible change in the clear version of the program.)


Comments on this page:

From 138.246.85.200 at 2012-12-19 11:50:44:

At first thought, I'd have used

 awk '/^---HTTP/,/^---/'

... but that doesn't work because the flip-flop goes off again instantly. Thus, sed to the rescue:

 sed -n '/^---HTTP/,/^---/{/^---/!p;}'
Written on 19 December 2012.
« Why I'm still using VMware
Sysadmins should pretty much version control everything »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed Dec 19 01:28:57 2012
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.