Wandering Thoughts archives

2018-02-28

egrep's -o argument is great for extracting unusual fields

In Unix, many files and other streams of text are nicely structured so you can extract bits of them with straightforward tools like awk. Fields are nicely separated by whitespace (or by some simple thing that you can easily match on), the information you want is only in a single field, and the field is at a known and generally fixed offset (either from the start of the line or the end of the line). However, not all text is like this. Sometimes it's because people have picked bad formats. Sometimes it's just because that's how the data comes to you; perhaps you have full file paths and you want to extract one component of the path that has some interesting characteristic, such as starting with a '.'.

For example, recently we wanted to know if people here stored IMAP mailboxes in or under directories whose name started with a dot, and if they did, what directory names they used. We had full paths from IMAP subscriptions, but we didn't care about the whole path, just the interesting directory names. Tools like awk are not a good match for this; even with 'awk -F/' we'd have to dig out the fields that start with a dot.

(There's a UNIX pipeline solution to this problem, of course.)

Fortunately, these days I have a good option for this, and that is (e)grep's -o argument. I learned about it several years ago due to a comment on this entry of mine, and since then it's become a tool that I reach for increasingly often. What -o does is described by the manpage this way (for GNU grep):

Print only the matched (non-empty) parts of a matching line, with each such part on a separate output line.

What this really means is extract and print regular expression based field(s) from the line. The straightforward use is to extract a full field, for example:

egrep -o '(^|/)\.[^/]+' <filenames

This extracts just the directory name of interest (or I suppose the file name, if there is a file that starts with a dot). It also shows that we may need to post-process the result of an egrep -o field extraction; in this case, some of the names will have a '/' on the front and some won't, and we probably want to remove that /.

Another trick with egrep -o is to use it to put fields into consistent places. Suppose that our email system's logs have a variety of messages that can be generated when a sending IP address is in a DNS blocklist. The full log lines vary but they all contain a portion that goes 'sender IP <IP> [stuff] in <DNSBL>'. We would like to extract the sender IP address and perhaps the DNSBL. Plain 'egrep -o' doesn't do this directly, but it will put the two fields we care about into consistent places:

egrep -o 'sender IP .* in .*(DNSBL1|DNSBL2)' <logfile |
   awk '{print $3, $(NF)}'

Another option for extracting fields from in the middle of a large message is to use two or more egreps in a pipeline, with each egrep successively refining the text down to just the bits you're interested in. This is useful when the specific piece you're interested in occurs at some irregular position inside a longer portion that you need to use as the initial match.

(I'm not going to try to give an example here, as I don't have any from stuff I've done recently enough to remember.)

Since you can use grep with multiple patterns (by providing multiple -e arguments), you can use grep -o to extract several fields at once. However the limitation of this is that each field comes out on its own line. There are situations where you'd like all fields from one line to come out on the same line; basically you want the original line with all of the extraneous bits removed (all of the bits except the fields you care about). If you're in this situation, you probably want to turn to sed instead.

In theory sed can do all of these things with the s substitution command, regular expression sub-expressions, and \<N> substitutions to give back just the sub-expressions. In practice I find egrep -o to almost always be the simpler way to go, partly because I can never remember the exact way to make sed regular expressions do sub-expression matching. Perhaps I would if I used sed more often, but I don't.

(I once wrote a moderately complex and scary sed script mostly to get myself to use various midlevel sed features like the pattern space. It works and I still use it every day when reading my email, but it also convinced me that I didn't want to do that again and sed was a write-mostly language.)

In short, any time I want to extract a field from a line and awk won't do it (or at least not easily), I turn to 'egrep -o' as the generally quick and convenient option. All I have to do is write a regular expression that matches the field, or enough more than the field that I can then either extract the field with awk or narrow things down with more use of egrep -o.

PS: grep -o is probably originally a GNU-ism, but I think it's relatively widespread. It's in OpenBSD's grep, for example.

PPS: I usually think of this as an egrep feature, but it's in plain grep and even fgrep (and if I think about it, I can actually think of uses for it in fgrep). I just reflexively turn to egrep over grep if I'm doing anything complicated, and using -o definitely counts.

Sidebar: The Unix pipeline solution to the filename problem

In the spirit of the original spell implementation:

tr '/' '\n' <fullpaths | grep '^\.' | sort -u

All we want is path components, so the traditional Unix answer is to explode full paths into path components (done here with tr). Once they're in that form, we can apply the full power of normal Unix things to them.

EgrepOFieldExtraction written at 23:29:47; Add Comment

2018-02-23

Github and publishing Git repositories

Recently I got into a discussion on Twitter where I mentioned that I'd like a simple way to publish Git repositories on my own web server. You might reasonably ask why I need such a thing, since Github exists and I even use it. For me, a significant part of the answer is social. To put it one way, Github has become a little bit too formal, or at least I perceive it as having done so.

What has done this to Github is that more and more, people will look at your Github presence and form judgements based on what they see. They will go through your list of repositories and form opinions, and then look inside some of the repositories and form more opinions. At least part of this exploration is natural and simply comes from stumbling over something interesting; more than once, I've wound up on someone's repository and wondered what else they work on and if there's anything interesting there. But a certain amount of it is the straightforward and logical consequence of the common view that Github is part of your developer resume. We curate our resumes, and if our Github presence is part of that, well, we're going to curate that too. A public portfolio of work always tries to put your best foot forward, and even if that's not necessarily my goal with my Github presence, I still know that that's how people may take it.

All of this makes me feel uncomfortable about throwing messy experiments and one-off hacks up on Github. If nothing else, they feel like clutter that gets in the way of people seeing (just) the repositories that I'm actively proud of, want to attract attention to, and think that people might find something useful in. Putting something up on Github just so people can get a copy of it feels not so much wrong as out of place; that's not what I use my Github presence for.

(A strongly related issue are the signals that I suspect that your Github presence sends when you file issues in other people's Github repositories. Some of the time people are going to look at your profile, your activities, and your repositories to assess your clue level, especially if you're reporting something tangled and complex. If you want people to take your issues seriously, a presence that signals 'I probably know what I'm doing' is pretty useful.)

A separate set of Git repositories elsewhere, in a less formal space, avoids all of these issues. No one is going to mistake a set of repositories explicitly labeled 'random stuff I'm throwing up in case people want to look' for anything more than that, and to even find it in the first place they would have to go on a much more extensive hunt than it takes to get to my Github presence (which I do link in various places because, well, it's my Github presence, the official place where I publish various things).

Sidebar: What I want in a Git repository publishing program

The minimal thing I need is something you can do git clone and git pull from, because that is the very basic start of publishing a Git repository. What I'd like is something that also gave a decent looking web view as well, with a description and showing a README, so that people don't have to clone a repository just to poke around in it. Truly ideal would be also providing tarball or zip archive downloads. All of this should be read-only; accepting git push and other such operations is an anti-feature.

It would be ideal if the program ran as a CGI, because CGIs are easy to manage and I don't expect much load. I'll live with a daemon that runs via FastCGI, but it can't be its own web server unless it can work behind another web server via a reverse proxy, since I already have a perfectly good web server that is serving things I care a lot more about.

(Also, frankly I don't trust random web server implementations to do HTTPS correctly and securely, and HTTPS is no longer optional. Doing HTTPS well is so challenging that not all dedicated, full scale web servers manage it.)

It's possible that git http-backend actually does what I want here, if I can set it up appropriately. Alternately, maybe cgit is what I want. I'll have to do some experimentation.

GithubAndGitRepoPublishing written at 00:59:49; Add Comment

2018-02-21

Sorting out what exec does in Bourne shell pipelines

Today, I was revising a Bourne shell script. The original shell script ended by running rsync with an exec like this:

exec rsync ...

(I don't think the exec was there for any good reason; it's a reflex.)

I was adding some filtering of errors from rsync, so I fed its standard error to egrep and in the process I removed the exec, so it became:

rsync ... 2>&1 | egrep -v '^(...|...)'

Then I stopped to think about this, and realized that I was working on superstition. I 'knew' that combining exec and anything else didn't work, and in fact I had a memory that it caused things to malfunction. So I decided to investigate a bit to find out the truth.

To start with, let's talk about what we could think that exec did here (and what I hoped it did when I started digging). Suppose that you end a shell script like this:

#!/bin/sh
[...]
rsync ... 2>&1 | egrep -v '...'

When you run this shell script, you'll wind up with a hierarchy of three processes; the shell is the parent process, and then generally the rsync and the egrep are siblings. Linux's pstree will represent this as 'sh───2*[sleep]', and my favorite tool shows it like so:

pts/10   |      17346 /bin/sh thescript
pts/10    |     17347 rsync ...
pts/10    |     17348 egrep ...

If exec worked here the way I was sort of hoping it would, you'd get two processes instead of three, with whatever you exec'd (either the rsync or the egrep) taking over from the parent shell process. Now that I think about it, there are some reasonably decent reasons to not do this, but let's set that aside for now.

What I had a vague superstition of exec doing in a pipeline was that it might abruptly truncate the pipeline. When it go to the exec the shell just did what you told it to, ie exec the process, and since it had turned itself into a process it didn't go on to set up the rest of the pipeline. That would make 'exec rsync ... | egrep' be the same as just 'exec rsync ...', with the egrep effectively ignored. Obviously you wouldn't want that, hence me automatically taking the exec out.

Fortunately this is not what happens. What actually does happen is not quite that the exec is ignored, although that's what it looks like in simple cases. To understand what's going on, I had to start by paying careful attention to how exec is described, for example in Dash's manpage:

Unless command is omitted, the shell process is replaced with the specified program [...]

I have emphasized the important bit. The magic trick is what 'the shell process' is in a pipeline. If we write:

exec rsync ... | egrep -v ...

When the shell gets to processing the exec, what it considers 'the shell process' is actually the subshell running one step of the pipeline, here the subshell that exists to run rsync. This subshell is normally invisible here because for simple commands like this, the (sub)shell will immediately exec() rsync anyway; using exec just instructs this subshell to do what it was already going to do.

We can cause the shell to actually materialize a subshell by putting multiple commands here:

(/bin/echo hi; sleep 120) | cat

If you look at the process tree for this, you'll probably get:

pts/9    |      7481 sh
pts/9     |     7806 sh
pts/9      |    7808 sleep 120
pts/9     |     7807 cat

The subshell making up the first step of the pipeline could end by just exec()ing sleep, but it doesn't (at least in Dash and Bash); once the shell has decided to have a real subshell here, it stays a real subshell.

If you use exec in the context of such an actual subshell, it will indeed replace 'the shell process' of the subshell with the command you exec:

$ (exec echo hi; echo ho) | cat
hi
$

The exec replaced the entire subshell with the first echo, and so it never went on to run the second echo.

(Effectively you've arranged for an early termination of the subshell. There are probably times when this is useful behavior as part of a pipeline step, but I think you can generally use exit and what you're actually doing will be clearer.)

(I'm sure that I once knew all of this, but it fell out of my mind until I carefully worked it out again just now. Perhaps this time around it will stick.)

Sidebar: some of this behavior can vary by shell

Let's go back to '(/bin/echo hi; sleep 120) | cat'. In Dash and Bash, the first step's subshell sticks around to be the parent process of sleep, as mentioned. Somewhat to my surprise, both the Fedora Linux version of official ksh93 and FreeBSD 10.4's sh do optimize away the subshell in this situation. They directly exec the sleep, as if you wrote:

(/bin/echo hi; exec sleep 120) | cat

There's probably a reason that Bash skips this little optimization.

BourneExecInPipeline written at 22:30:27; Add Comment

2018-02-05

I should remember that sometimes C is a perfectly good option

Recently I found myself needing a Linux command that reported how many CPUs are available for you to use. On Linux, the official way to do this is to call sched_getaffinity and count how many 1 bits are set in the CPU mask that you get back. My default tool for this sort of thing these days is Go and I found some convenient support for this (in the golang.org/x/sys/unix package), so I wrote the obvious Go program:

package main
import (
    "fmt"
    "os"
    "golang.org/x/sys/unix"
)

func main() {
    var cpuset unix.CPUSet
    err := unix.SchedGetaffinity(0, &cpuset)
    if err != nil {
        fmt.Printf("numcpus: cannot get affinity: %s\n", err)
        os.Exit(1)
    }
    fmt.Printf("%d\n", cpuset.Count())
}

This compiled, ran on most of our machines, and then reported an 'invalid argument' error on some of them. After staring at strace output for a while, I decided that I needed to write a C version of this so I understood exactly what it was doing and what I was seeing. I was expecting this to be annoying (because it would involve writing code to count bits), but it turns out that there's a set of macros for this so the code is just:

#define _GNU_SOURCE
#include    <sched.h>
#include    <unistd.h>
#include    <stdio.h>
#include    <stdlib.h>

#define MAXCPUS 0x400

int main(int argc, char **argv) {
    cpu_set_t *cpuset;
    cpuset = CPU_ALLOC(MAXCPUS);

    if (sched_getaffinity(0, CPU_ALLOC_SIZE(MAXCPUS), cpuset) < 0) {
        fprintf(stderr, "numcpus: sched_getaffinity: %m\n");
        exit(1);
    }
    printf("%d\n", CPU_COUNT(cpuset));
}

(I think I have an unnecessary include file in there but I don't care. I spray standard include files into my C programs until the compiler stops complaining. Also, I'm using a convenient glibc printf() extension since I'm writing for Linux.)

This compiled, worked, and demonstrated that what I was seeing was indeed a bug in the x/sys/unix package. I don't blame Go for this, by the way. Bugs can happen anywhere, and they're generally more likely to happen in my code than in library code (that's one reason I like to use library code whenever possible).

The Go version and the C version are roughly the same number of lines and wound up being roughly as complicated to write (although the C version fails to check for an out of memory condition that's extremely unlikely to ever happen).

The Go version builds to a 64-bit Linux binary that is 1.1 Mbytes on disk. The C version builds to a 64-bit Linux binary that is 5 Kbytes on disk.

(This is not particularly Go's fault, lest people think that I'm picking on it. The Go binary is statically linked, for example, while the C version is dynamically linked; statically linking the C version results in an 892 Kbyte binary. Of course, in practice it's a lot easier to dynamically link and run a program written in C than in anything else because glibc is so pervasive.)

When I started writing this entry, I was going to say that what I took from this is that sometimes C is the right answer. Perhaps it is, but that's too strong a conclusion for this example. Yes, the C version is the same size in source code and much smaller as a binary (and that large Go binary does sort of offend my old time Unix soul). But if the Go program had worked I wouldn't have cared enough about its size to write a C version, and if the CPU_SET macros didn't exist with exactly what I needed, the C version would certainly have been more annoying to write. And there is merit in focusing on a small set of tools that you like and know pretty well, even if they're not the ideal fit for every situation.

But still. There is merit in remembering that C exists and is perfectly useful and many things, especially low level operating system things, are probably quite direct to do in C. I could probably write more C than I do, and sometimes it might be no more work than doing it in another language. And I'd get small binaries, which a part of me cares about.

(At the same time, these days I generally find C to be annoying. It forces me to care about things that I mostly don't want to care about any more, like memory handling and making sure that I'm not going to blow my foot off.)

PS: I'm a little bit surprised and depressed that the statically linked C program is so close to the Go program in size, because the Go program includes a lot of complex runtime support in that 1.1 Mbytes (including an entire garbage collector). The C program has no such excuses.

CSometimesGoodAnswer written at 23:34:16; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.