Wandering Thoughts archives

2018-02-28

egrep's -o argument is great for extracting unusual fields

In Unix, many files and other streams of text are nicely structured so you can extract bits of them with straightforward tools like awk. Fields are nicely separated by whitespace (or by some simple thing that you can easily match on), the information you want is only in a single field, and the field is at a known and generally fixed offset (either from the start of the line or the end of the line). However, not all text is like this. Sometimes it's because people have picked bad formats. Sometimes it's just because that's how the data comes to you; perhaps you have full file paths and you want to extract one component of the path that has some interesting characteristic, such as starting with a '.'.

For example, recently we wanted to know if people here stored IMAP mailboxes in or under directories whose name started with a dot, and if they did, what directory names they used. We had full paths from IMAP subscriptions, but we didn't care about the whole path, just the interesting directory names. Tools like awk are not a good match for this; even with 'awk -F/' we'd have to dig out the fields that start with a dot.

(There's a UNIX pipeline solution to this problem, of course.)

Fortunately, these days I have a good option for this, and that is (e)grep's -o argument. I learned about it several years ago due to a comment on this entry of mine, and since then it's become a tool that I reach for increasingly often. What -o does is described by the manpage this way (for GNU grep):

Print only the matched (non-empty) parts of a matching line, with each such part on a separate output line.

What this really means is extract and print regular expression based field(s) from the line. The straightforward use is to extract a full field, for example:

egrep -o '(^|/)\.[^/]+' <filenames

This extracts just the directory name of interest (or I suppose the file name, if there is a file that starts with a dot). It also shows that we may need to post-process the result of an egrep -o field extraction; in this case, some of the names will have a '/' on the front and some won't, and we probably want to remove that /.

Another trick with egrep -o is to use it to put fields into consistent places. Suppose that our email system's logs have a variety of messages that can be generated when a sending IP address is in a DNS blocklist. The full log lines vary but they all contain a portion that goes 'sender IP <IP> [stuff] in <DNSBL>'. We would like to extract the sender IP address and perhaps the DNSBL. Plain 'egrep -o' doesn't do this directly, but it will put the two fields we care about into consistent places:

egrep -o 'sender IP .* in .*(DNSBL1|DNSBL2)' <logfile |
   awk '{print $3, $(NF)}'

Another option for extracting fields from in the middle of a large message is to use two or more egreps in a pipeline, with each egrep successively refining the text down to just the bits you're interested in. This is useful when the specific piece you're interested in occurs at some irregular position inside a longer portion that you need to use as the initial match.

(I'm not going to try to give an example here, as I don't have any from stuff I've done recently enough to remember.)

Since you can use grep with multiple patterns (by providing multiple -e arguments), you can use grep -o to extract several fields at once. However the limitation of this is that each field comes out on its own line. There are situations where you'd like all fields from one line to come out on the same line; basically you want the original line with all of the extraneous bits removed (all of the bits except the fields you care about). If you're in this situation, you probably want to turn to sed instead.

In theory sed can do all of these things with the s substitution command, regular expression sub-expressions, and \<N> substitutions to give back just the sub-expressions. In practice I find egrep -o to almost always be the simpler way to go, partly because I can never remember the exact way to make sed regular expressions do sub-expression matching. Perhaps I would if I used sed more often, but I don't.

(I once wrote a moderately complex and scary sed script mostly to get myself to use various midlevel sed features like the pattern space. It works and I still use it every day when reading my email, but it also convinced me that I didn't want to do that again and sed was a write-mostly language.)

In short, any time I want to extract a field from a line and awk won't do it (or at least not easily), I turn to 'egrep -o' as the generally quick and convenient option. All I have to do is write a regular expression that matches the field, or enough more than the field that I can then either extract the field with awk or narrow things down with more use of egrep -o.

PS: grep -o is probably originally a GNU-ism, but I think it's relatively widespread. It's in OpenBSD's grep, for example.

PPS: I usually think of this as an egrep feature, but it's in plain grep and even fgrep (and if I think about it, I can actually think of uses for it in fgrep). I just reflexively turn to egrep over grep if I'm doing anything complicated, and using -o definitely counts.

Sidebar: The Unix pipeline solution to the filename problem

In the spirit of the original spell implementation:

tr '/' '\n' <fullpaths | grep '^\.' | sort -u

All we want is path components, so the traditional Unix answer is to explode full paths into path components (done here with tr). Once they're in that form, we can apply the full power of normal Unix things to them.

programming/EgrepOFieldExtraction written at 23:29:47; Add Comment

Using Python 3 for example code here on Wandering Thoughts

When I write about Python here, I often wind up having some example Python code, such as the subCls example in my recent entry about subclassing a __slots__ class. Mostly this Python code has been Python 2 by default, with Python 3 as the exception. When I started writing, Python 3 wasn't even released; then it wasn't really something you wanted to use; and then I was grumpy about it so I deliberately continued to use Python 2 for examples here, just as I continued to write programs in it (for good reasons). Sometimes I explicitly mentioned that my examples were in Python 2, but sometimes not, and that too was a bit of my grumpiness in action.

(There was also the small fact that I'm far more familiar with Python 2 than Python 3, so writing Python 2 code is what happens if I don't actively think about it.)

However, things change. Over the past few years I've basically made my peace with Python 3 and these days I'm trying to write new code in Python 3. Although writing my example code here in Python 2 is close to being a reflex, it's one that I want to consciously break. Going forward from now, I'm going to write sample code in Python 3 by default and only use Python 2 if there is some special reason for it (and then mention explicitly that the example is Python 2 instead of 3). This is a small gesture, but I figure it's about time, and it's also probably what more and more readers are just going to expect.

(It looks like I've been doing this inconsistently for a while, or at least testing some of my examples in Python 3 too, eg, and also increasingly linking to the Python 3 version of Python documentation instead of the Python 2 version.)

Actually doing this is going to take me some work and attention. Since I write Python 2 code by reflex, I'm going to have to double-check my examples to make sure that they're valid Python 3 (and that they behave the same way in Python 3). Some of the time this will mean actually testing even small fragments instead of relying on my Python (2) knowledge to write from memory. Also, when I'm checking Python's behavior for something (or prototyping some code), I'll have to remember to run python3 instead of just python or I'll accidentally wind up testing the wrong Python.

(When I wrote my recent entry I was quietly careful to make the example code Python 3 code by including a super() and then using the no-argument version, which is Python 3 only.)

(I'm writing this entry partly to put a marker in the ground for myself, so that I won't be tempted to let a Python 2 example slide just because I'm feeling lazy and I don't want to work out and verify the Python 3 version.)

python/Python3ForExamples written at 02:24:08; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.