2018-02-28
egrep's -o
argument is great for extracting unusual fields
In Unix, many files and other streams of text are nicely structured
so you can extract bits of them with straightforward tools like
awk
. Fields are nicely separated by whitespace (or by some simple
thing that you can easily match on), the information you want is
only in a single field, and the field is at a known and generally
fixed offset (either from the start of the line or the end of the
line). However, not all text is like this. Sometimes it's because
people have picked bad formats. Sometimes
it's just because that's how the data comes to you; perhaps you
have full file paths and you want to extract one component of the
path that has some interesting characteristic, such as starting
with a '.'.
For example, recently we wanted to know if people here stored IMAP
mailboxes in or under directories whose name started with a dot,
and if they did, what directory names they used. We had full paths
from IMAP subscriptions, but
we didn't care about the whole path, just the interesting directory
names. Tools like awk
are not a good match for this; even with
'awk -F/
' we'd have to dig out the fields that start with a dot.
(There's a UNIX pipeline solution to this problem, of course.)
Fortunately, these days I have a good option for this, and that is
(e)grep's -o
argument. I learned about it several years ago due
to a comment on this entry of mine, and
since then it's become a tool that I reach for increasingly often.
What -o
does is described by the manpage this way (for GNU grep):
Print only the matched (non-empty) parts of a matching line, with each such part on a separate output line.
What this really means is extract and print regular expression based field(s) from the line. The straightforward use is to extract a full field, for example:
egrep -o '(^|/)\.[^/]+' <filenames
This extracts just the directory name of interest (or I suppose the
file name, if there is a file that starts with a dot). It also shows
that we may need to post-process the result of an egrep -o
field
extraction; in this case, some of the names will have a '/' on the
front and some won't, and we probably want to remove that /.
Another trick with egrep -o
is to use it to put fields into
consistent places. Suppose that our email system's logs have a
variety of messages that can be generated when a sending IP address
is in a DNS blocklist. The full log lines vary but they all contain
a portion that goes 'sender IP <IP> [stuff] in <DNSBL>
'. We would
like to extract the sender IP address and perhaps the DNSBL. Plain
'egrep -o
' doesn't do this directly, but it will put the two fields
we care about into consistent places:
egrep -o 'sender IP .* in .*(DNSBL1|DNSBL2)' <logfile | awk '{print $3, $(NF)}'
Another option for extracting fields from in the middle of a large
message is to use two or more egrep
s in a pipeline, with each
egrep
successively refining the text down to just the bits you're
interested in. This is useful when the specific piece you're
interested in occurs at some irregular position inside a longer
portion that you need to use as the initial match.
(I'm not going to try to give an example here, as I don't have any from stuff I've done recently enough to remember.)
Since you can use grep
with multiple patterns (by providing
multiple -e
arguments), you can use grep -o
to extract several
fields at once. However the limitation of this is that each field
comes out on its own line. There are situations where you'd like
all fields from one line to come out on the same line; basically
you want the original line with all of the extraneous bits removed
(all of the bits except the fields you care about). If you're in
this situation, you probably want to turn to sed
instead.
In theory sed
can do all of these things with the s
substitution
command, regular expression sub-expressions, and \<N>
substitutions
to give back just the sub-expressions. In practice I find egrep
-o
to almost always be the simpler way to go, partly because I can
never remember the exact way to make sed
regular expressions do
sub-expression matching. Perhaps I would if I used sed
more often,
but I don't.
(I once wrote a moderately complex and scary sed
script mostly
to get myself to use various midlevel sed
features like the pattern
space. It works and I still use it every day when reading my email,
but it also convinced me that I didn't want to do that again and
sed
was a write-mostly language.)
In short, any time I want to extract a field from a line and awk
won't do it (or at least not easily), I turn to 'egrep -o
' as the
generally quick and convenient option. All I have to do is write a
regular expression that matches the field, or enough more than the
field that I can then either extract the field with awk
or narrow
things down with more use of egrep -o
.
PS: grep -o
is probably originally a GNU-ism, but I think it's
relatively widespread. It's in OpenBSD's grep
, for example.
PPS: I usually think of this as an egrep
feature, but it's in
plain grep
and even fgrep
(and if I think about it, I can
actually think of uses for it in fgrep
). I just reflexively
turn to egrep
over grep
if I'm doing anything complicated,
and using -o
definitely counts.
Sidebar: The Unix pipeline solution to the filename problem
In the spirit of the original spell
implementation:
tr '/' '\n' <fullpaths | grep '^\.' | sort -u
All we want is path components, so the traditional Unix answer is
to explode full paths into path components (done here with tr
).
Once they're in that form, we can apply the full power of normal
Unix things to them.
Using Python 3 for example code here on Wandering Thoughts
When I write about Python here, I often wind up having
some example Python code, such as the subCls
example in my
recent entry about subclassing a __slots__ class. Mostly this Python code has been
Python 2 by default, with Python 3 as the exception. When I started
writing, Python 3 wasn't even released; then it wasn't really
something you wanted to use; and then I was grumpy about it so I deliberately continued to use Python 2 for
examples here, just as I continued to write programs in it (for
good reasons). Sometimes I explicitly mentioned
that my examples were in Python 2, but sometimes not, and that too
was a bit of my grumpiness in action.
(There was also the small fact that I'm far more familiar with Python 2 than Python 3, so writing Python 2 code is what happens if I don't actively think about it.)
However, things change. Over the past few years I've basically made my peace with Python 3 and these days I'm trying to write new code in Python 3. Although writing my example code here in Python 2 is close to being a reflex, it's one that I want to consciously break. Going forward from now, I'm going to write sample code in Python 3 by default and only use Python 2 if there is some special reason for it (and then mention explicitly that the example is Python 2 instead of 3). This is a small gesture, but I figure it's about time, and it's also probably what more and more readers are just going to expect.
(It looks like I've been doing this inconsistently for a while, or at least testing some of my examples in Python 3 too, eg, and also increasingly linking to the Python 3 version of Python documentation instead of the Python 2 version.)
Actually doing this is going to take me some work and attention.
Since I write Python 2 code by reflex, I'm going to have to
double-check my examples to make sure that they're valid Python 3
(and that they behave the same way in Python 3). Some of the time
this will mean actually testing even small fragments instead of
relying on my Python (2) knowledge to write from memory. Also, when
I'm checking Python's behavior for something (or prototyping some
code), I'll have to remember to run python3
instead of just
python
or I'll accidentally wind up testing the wrong Python.
(When I wrote my recent entry I
was quietly careful to make the example code Python 3 code by
including a super()
and then using the no-argument version,
which is Python 3 only.)
(I'm writing this entry partly to put a marker in the ground for myself, so that I won't be tempted to let a Python 2 example slide just because I'm feeling lazy and I don't want to work out and verify the Python 3 version.)