egrep's -o
argument is great for extracting unusual fields
In Unix, many files and other streams of text are nicely structured
so you can extract bits of them with straightforward tools like
awk
. Fields are nicely separated by whitespace (or by some simple
thing that you can easily match on), the information you want is
only in a single field, and the field is at a known and generally
fixed offset (either from the start of the line or the end of the
line). However, not all text is like this. Sometimes it's because
people have picked bad formats. Sometimes
it's just because that's how the data comes to you; perhaps you
have full file paths and you want to extract one component of the
path that has some interesting characteristic, such as starting
with a '.'.
For example, recently we wanted to know if people here stored IMAP
mailboxes in or under directories whose name started with a dot,
and if they did, what directory names they used. We had full paths
from IMAP subscriptions, but
we didn't care about the whole path, just the interesting directory
names. Tools like awk
are not a good match for this; even with
'awk -F/
' we'd have to dig out the fields that start with a dot.
(There's a UNIX pipeline solution to this problem, of course.)
Fortunately, these days I have a good option for this, and that is
(e)grep's -o
argument. I learned about it several years ago due
to a comment on this entry of mine, and
since then it's become a tool that I reach for increasingly often.
What -o
does is described by the manpage this way (for GNU grep):
Print only the matched (non-empty) parts of a matching line, with each such part on a separate output line.
What this really means is extract and print regular expression based field(s) from the line. The straightforward use is to extract a full field, for example:
egrep -o '(^|/)\.[^/]+' <filenames
This extracts just the directory name of interest (or I suppose the
file name, if there is a file that starts with a dot). It also shows
that we may need to post-process the result of an egrep -o
field
extraction; in this case, some of the names will have a '/' on the
front and some won't, and we probably want to remove that /.
Another trick with egrep -o
is to use it to put fields into
consistent places. Suppose that our email system's logs have a
variety of messages that can be generated when a sending IP address
is in a DNS blocklist. The full log lines vary but they all contain
a portion that goes 'sender IP <IP> [stuff] in <DNSBL>
'. We would
like to extract the sender IP address and perhaps the DNSBL. Plain
'egrep -o
' doesn't do this directly, but it will put the two fields
we care about into consistent places:
egrep -o 'sender IP .* in .*(DNSBL1|DNSBL2)' <logfile | awk '{print $3, $(NF)}'
Another option for extracting fields from in the middle of a large
message is to use two or more egrep
s in a pipeline, with each
egrep
successively refining the text down to just the bits you're
interested in. This is useful when the specific piece you're
interested in occurs at some irregular position inside a longer
portion that you need to use as the initial match.
(I'm not going to try to give an example here, as I don't have any from stuff I've done recently enough to remember.)
Since you can use grep
with multiple patterns (by providing
multiple -e
arguments), you can use grep -o
to extract several
fields at once. However the limitation of this is that each field
comes out on its own line. There are situations where you'd like
all fields from one line to come out on the same line; basically
you want the original line with all of the extraneous bits removed
(all of the bits except the fields you care about). If you're in
this situation, you probably want to turn to sed
instead.
In theory sed
can do all of these things with the s
substitution
command, regular expression sub-expressions, and \<N>
substitutions
to give back just the sub-expressions. In practice I find egrep
-o
to almost always be the simpler way to go, partly because I can
never remember the exact way to make sed
regular expressions do
sub-expression matching. Perhaps I would if I used sed
more often,
but I don't.
(I once wrote a moderately complex and scary sed
script mostly
to get myself to use various midlevel sed
features like the pattern
space. It works and I still use it every day when reading my email,
but it also convinced me that I didn't want to do that again and
sed
was a write-mostly language.)
In short, any time I want to extract a field from a line and awk
won't do it (or at least not easily), I turn to 'egrep -o
' as the
generally quick and convenient option. All I have to do is write a
regular expression that matches the field, or enough more than the
field that I can then either extract the field with awk
or narrow
things down with more use of egrep -o
.
PS: grep -o
is probably originally a GNU-ism, but I think it's
relatively widespread. It's in OpenBSD's grep
, for example.
PPS: I usually think of this as an egrep
feature, but it's in
plain grep
and even fgrep
(and if I think about it, I can
actually think of uses for it in fgrep
). I just reflexively
turn to egrep
over grep
if I'm doing anything complicated,
and using -o
definitely counts.
Sidebar: The Unix pipeline solution to the filename problem
In the spirit of the original spell
implementation:
tr '/' '\n' <fullpaths | grep '^\.' | sort -u
All we want is path components, so the traditional Unix answer is
to explode full paths into path components (done here with tr
).
Once they're in that form, we can apply the full power of normal
Unix things to them.
|
|