Wandering Thoughts archives

2018-02-28

egrep's -o argument is great for extracting unusual fields

In Unix, many files and other streams of text are nicely structured so you can extract bits of them with straightforward tools like awk. Fields are nicely separated by whitespace (or by some simple thing that you can easily match on), the information you want is only in a single field, and the field is at a known and generally fixed offset (either from the start of the line or the end of the line). However, not all text is like this. Sometimes it's because people have picked bad formats. Sometimes it's just because that's how the data comes to you; perhaps you have full file paths and you want to extract one component of the path that has some interesting characteristic, such as starting with a '.'.

For example, recently we wanted to know if people here stored IMAP mailboxes in or under directories whose name started with a dot, and if they did, what directory names they used. We had full paths from IMAP subscriptions, but we didn't care about the whole path, just the interesting directory names. Tools like awk are not a good match for this; even with 'awk -F/' we'd have to dig out the fields that start with a dot.

(There's a UNIX pipeline solution to this problem, of course.)

Fortunately, these days I have a good option for this, and that is (e)grep's -o argument. I learned about it several years ago due to a comment on this entry of mine, and since then it's become a tool that I reach for increasingly often. What -o does is described by the manpage this way (for GNU grep):

Print only the matched (non-empty) parts of a matching line, with each such part on a separate output line.

What this really means is extract and print regular expression based field(s) from the line. The straightforward use is to extract a full field, for example:

egrep -o '(^|/)\.[^/]+' <filenames

This extracts just the directory name of interest (or I suppose the file name, if there is a file that starts with a dot). It also shows that we may need to post-process the result of an egrep -o field extraction; in this case, some of the names will have a '/' on the front and some won't, and we probably want to remove that /.

Another trick with egrep -o is to use it to put fields into consistent places. Suppose that our email system's logs have a variety of messages that can be generated when a sending IP address is in a DNS blocklist. The full log lines vary but they all contain a portion that goes 'sender IP <IP> [stuff] in <DNSBL>'. We would like to extract the sender IP address and perhaps the DNSBL. Plain 'egrep -o' doesn't do this directly, but it will put the two fields we care about into consistent places:

egrep -o 'sender IP .* in .*(DNSBL1|DNSBL2)' <logfile |
   awk '{print $3, $(NF)}'

Another option for extracting fields from in the middle of a large message is to use two or more egreps in a pipeline, with each egrep successively refining the text down to just the bits you're interested in. This is useful when the specific piece you're interested in occurs at some irregular position inside a longer portion that you need to use as the initial match.

(I'm not going to try to give an example here, as I don't have any from stuff I've done recently enough to remember.)

Since you can use grep with multiple patterns (by providing multiple -e arguments), you can use grep -o to extract several fields at once. However the limitation of this is that each field comes out on its own line. There are situations where you'd like all fields from one line to come out on the same line; basically you want the original line with all of the extraneous bits removed (all of the bits except the fields you care about). If you're in this situation, you probably want to turn to sed instead.

In theory sed can do all of these things with the s substitution command, regular expression sub-expressions, and \<N> substitutions to give back just the sub-expressions. In practice I find egrep -o to almost always be the simpler way to go, partly because I can never remember the exact way to make sed regular expressions do sub-expression matching. Perhaps I would if I used sed more often, but I don't.

(I once wrote a moderately complex and scary sed script mostly to get myself to use various midlevel sed features like the pattern space. It works and I still use it every day when reading my email, but it also convinced me that I didn't want to do that again and sed was a write-mostly language.)

In short, any time I want to extract a field from a line and awk won't do it (or at least not easily), I turn to 'egrep -o' as the generally quick and convenient option. All I have to do is write a regular expression that matches the field, or enough more than the field that I can then either extract the field with awk or narrow things down with more use of egrep -o.

PS: grep -o is probably originally a GNU-ism, but I think it's relatively widespread. It's in OpenBSD's grep, for example.

PPS: I usually think of this as an egrep feature, but it's in plain grep and even fgrep (and if I think about it, I can actually think of uses for it in fgrep). I just reflexively turn to egrep over grep if I'm doing anything complicated, and using -o definitely counts.

Sidebar: The Unix pipeline solution to the filename problem

In the spirit of the original spell implementation:

tr '/' '\n' <fullpaths | grep '^\.' | sort -u

All we want is path components, so the traditional Unix answer is to explode full paths into path components (done here with tr). Once they're in that form, we can apply the full power of normal Unix things to them.

EgrepOFieldExtraction written at 23:29:47; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.