2018-02-28
egrep's -o
argument is great for extracting unusual fields
In Unix, many files and other streams of text are nicely structured
so you can extract bits of them with straightforward tools like
awk
. Fields are nicely separated by whitespace (or by some simple
thing that you can easily match on), the information you want is
only in a single field, and the field is at a known and generally
fixed offset (either from the start of the line or the end of the
line). However, not all text is like this. Sometimes it's because
people have picked bad formats. Sometimes
it's just because that's how the data comes to you; perhaps you
have full file paths and you want to extract one component of the
path that has some interesting characteristic, such as starting
with a '.'.
For example, recently we wanted to know if people here stored IMAP
mailboxes in or under directories whose name started with a dot,
and if they did, what directory names they used. We had full paths
from IMAP subscriptions, but
we didn't care about the whole path, just the interesting directory
names. Tools like awk
are not a good match for this; even with
'awk -F/
' we'd have to dig out the fields that start with a dot.
(There's a UNIX pipeline solution to this problem, of course.)
Fortunately, these days I have a good option for this, and that is
(e)grep's -o
argument. I learned about it several years ago due
to a comment on this entry of mine, and
since then it's become a tool that I reach for increasingly often.
What -o
does is described by the manpage this way (for GNU grep):
Print only the matched (non-empty) parts of a matching line, with each such part on a separate output line.
What this really means is extract and print regular expression based field(s) from the line. The straightforward use is to extract a full field, for example:
egrep -o '(^|/)\.[^/]+' <filenames
This extracts just the directory name of interest (or I suppose the
file name, if there is a file that starts with a dot). It also shows
that we may need to post-process the result of an egrep -o
field
extraction; in this case, some of the names will have a '/' on the
front and some won't, and we probably want to remove that /.
Another trick with egrep -o
is to use it to put fields into
consistent places. Suppose that our email system's logs have a
variety of messages that can be generated when a sending IP address
is in a DNS blocklist. The full log lines vary but they all contain
a portion that goes 'sender IP <IP> [stuff] in <DNSBL>
'. We would
like to extract the sender IP address and perhaps the DNSBL. Plain
'egrep -o
' doesn't do this directly, but it will put the two fields
we care about into consistent places:
egrep -o 'sender IP .* in .*(DNSBL1|DNSBL2)' <logfile | awk '{print $3, $(NF)}'
Another option for extracting fields from in the middle of a large
message is to use two or more egrep
s in a pipeline, with each
egrep
successively refining the text down to just the bits you're
interested in. This is useful when the specific piece you're
interested in occurs at some irregular position inside a longer
portion that you need to use as the initial match.
(I'm not going to try to give an example here, as I don't have any from stuff I've done recently enough to remember.)
Since you can use grep
with multiple patterns (by providing
multiple -e
arguments), you can use grep -o
to extract several
fields at once. However the limitation of this is that each field
comes out on its own line. There are situations where you'd like
all fields from one line to come out on the same line; basically
you want the original line with all of the extraneous bits removed
(all of the bits except the fields you care about). If you're in
this situation, you probably want to turn to sed
instead.
In theory sed
can do all of these things with the s
substitution
command, regular expression sub-expressions, and \<N>
substitutions
to give back just the sub-expressions. In practice I find egrep
-o
to almost always be the simpler way to go, partly because I can
never remember the exact way to make sed
regular expressions do
sub-expression matching. Perhaps I would if I used sed
more often,
but I don't.
(I once wrote a moderately complex and scary sed
script mostly
to get myself to use various midlevel sed
features like the pattern
space. It works and I still use it every day when reading my email,
but it also convinced me that I didn't want to do that again and
sed
was a write-mostly language.)
In short, any time I want to extract a field from a line and awk
won't do it (or at least not easily), I turn to 'egrep -o
' as the
generally quick and convenient option. All I have to do is write a
regular expression that matches the field, or enough more than the
field that I can then either extract the field with awk
or narrow
things down with more use of egrep -o
.
PS: grep -o
is probably originally a GNU-ism, but I think it's
relatively widespread. It's in OpenBSD's grep
, for example.
PPS: I usually think of this as an egrep
feature, but it's in
plain grep
and even fgrep
(and if I think about it, I can
actually think of uses for it in fgrep
). I just reflexively
turn to egrep
over grep
if I'm doing anything complicated,
and using -o
definitely counts.
Sidebar: The Unix pipeline solution to the filename problem
In the spirit of the original spell
implementation:
tr '/' '\n' <fullpaths | grep '^\.' | sort -u
All we want is path components, so the traditional Unix answer is
to explode full paths into path components (done here with tr
).
Once they're in that form, we can apply the full power of normal
Unix things to them.
2018-02-23
Github and publishing Git repositories
Recently I got into a discussion on Twitter where I mentioned that I'd like a simple way to publish Git repositories on my own web server. You might reasonably ask why I need such a thing, since Github exists and I even use it. For me, a significant part of the answer is social. To put it one way, Github has become a little bit too formal, or at least I perceive it as having done so.
What has done this to Github is that more and more, people will look at your Github presence and form judgements based on what they see. They will go through your list of repositories and form opinions, and then look inside some of the repositories and form more opinions. At least part of this exploration is natural and simply comes from stumbling over something interesting; more than once, I've wound up on someone's repository and wondered what else they work on and if there's anything interesting there. But a certain amount of it is the straightforward and logical consequence of the common view that Github is part of your developer resume. We curate our resumes, and if our Github presence is part of that, well, we're going to curate that too. A public portfolio of work always tries to put your best foot forward, and even if that's not necessarily my goal with my Github presence, I still know that that's how people may take it.
All of this makes me feel uncomfortable about throwing messy experiments and one-off hacks up on Github. If nothing else, they feel like clutter that gets in the way of people seeing (just) the repositories that I'm actively proud of, want to attract attention to, and think that people might find something useful in. Putting something up on Github just so people can get a copy of it feels not so much wrong as out of place; that's not what I use my Github presence for.
(A strongly related issue are the signals that I suspect that your Github presence sends when you file issues in other people's Github repositories. Some of the time people are going to look at your profile, your activities, and your repositories to assess your clue level, especially if you're reporting something tangled and complex. If you want people to take your issues seriously, a presence that signals 'I probably know what I'm doing' is pretty useful.)
A separate set of Git repositories elsewhere, in a less formal space, avoids all of these issues. No one is going to mistake a set of repositories explicitly labeled 'random stuff I'm throwing up in case people want to look' for anything more than that, and to even find it in the first place they would have to go on a much more extensive hunt than it takes to get to my Github presence (which I do link in various places because, well, it's my Github presence, the official place where I publish various things).
Sidebar: What I want in a Git repository publishing program
The minimal thing I need is something you can do git clone
and
git pull
from, because that is the very basic start of publishing
a Git repository. What I'd like is something that also gave a decent
looking web view as well, with a description and showing a README
,
so that people don't have to clone a repository just to poke around
in it. Truly ideal would be also providing tarball or zip archive
downloads. All of this should be read-only; accepting git push
and other such operations is an anti-feature.
It would be ideal if the program ran as a CGI, because CGIs are easy to manage and I don't expect much load. I'll live with a daemon that runs via FastCGI, but it can't be its own web server unless it can work behind another web server via a reverse proxy, since I already have a perfectly good web server that is serving things I care a lot more about.
(Also, frankly I don't trust random web server implementations to do HTTPS correctly and securely, and HTTPS is no longer optional. Doing HTTPS well is so challenging that not all dedicated, full scale web servers manage it.)
It's possible that git http-backend
actually does what I
want here, if I can set it up appropriately. Alternately, maybe
cgit is what I want. I'll
have to do some experimentation.
2018-02-21
Sorting out what exec
does in Bourne shell pipelines
Today, I was revising a Bourne shell script. The original shell script
ended by running rsync
with an exec
like this:
exec rsync ...
(I don't think the exec
was there for any good reason; it's a
reflex.)
I was adding some filtering of errors from rsync
, so I fed its
standard error to egrep
and in the process I removed the exec
,
so it became:
rsync ... 2>&1 | egrep -v '^(...|...)'
Then I stopped to think about this, and realized that I was working
on superstition. I 'knew' that
combining exec
and anything else didn't work, and in fact I had
a memory that it caused things to malfunction. So I decided to
investigate a bit to find out the truth.
To start with, let's talk about what we could think that exec
did
here (and what I hoped it did when I started digging). Suppose that
you end a shell script like this:
#!/bin/sh [...] rsync ... 2>&1 | egrep -v '...'
When you run this shell script, you'll wind up with a hierarchy of
three processes; the shell is the parent process, and then generally
the rsync
and the egrep
are siblings. Linux's pstree
will
represent this as 'sh───2*[sleep]
', and my favorite tool shows it like so:
pts/10 | 17346 /bin/sh thescript pts/10 | 17347 rsync ... pts/10 | 17348 egrep ...
If exec
worked here the way I was sort of hoping it would, you'd
get two processes instead of three, with whatever you exec
'd
(either the rsync
or the egrep
) taking over from the parent
shell process. Now that I think about it, there are some reasonably
decent reasons to not do this, but let's set that aside for now.
What I had a vague superstition of exec
doing in a pipeline was
that it might abruptly truncate the pipeline. When it go to the
exec
the shell just did what you told it to, ie exec
the process,
and since it had turned itself into a process it didn't go on to
set up the rest of the pipeline. That would make 'exec rsync
... | egrep
' be the same as just 'exec rsync ...
', with the
egrep
effectively ignored. Obviously you wouldn't want that,
hence me automatically taking the exec
out.
Fortunately this is not what happens. What actually does happen is
not quite that the exec
is ignored, although that's what it looks
like in simple cases. To understand what's going on, I had to start
by paying careful attention to how exec
is described, for example
in Dash's manpage:
Unless command is omitted, the shell process is replaced with the specified program [...]
I have emphasized the important bit. The magic trick is what 'the shell process' is in a pipeline. If we write:
exec rsync ... | egrep -v ...
When the shell gets to processing the exec
, what it considers
'the shell process' is actually the subshell running one step of
the pipeline, here the subshell that exists to run rsync
. This
subshell is normally invisible here because for simple commands
like this, the (sub)shell will immediately exec()
rsync
anyway;
using exec
just instructs this subshell to do what it was already
going to do.
We can cause the shell to actually materialize a subshell by putting multiple commands here:
(/bin/echo hi; sleep 120) | cat
If you look at the process tree for this, you'll probably get:
pts/9 | 7481 sh pts/9 | 7806 sh pts/9 | 7808 sleep 120 pts/9 | 7807 cat
The subshell making up the first step of the pipeline could end by
just exec()
ing sleep
, but it doesn't (at least in Dash and
Bash); once the shell has decided to have a real subshell here, it
stays a real subshell.
If you use exec
in the context of such an actual subshell, it
will indeed replace 'the shell process' of the subshell with the
command you exec
:
$ (exec echo hi; echo ho) | cat hi $
The exec
replaced the entire subshell with the first echo
, and
so it never went on to run the second echo
.
(Effectively you've arranged for an early termination of the subshell.
There are probably times when this is useful behavior as part of a
pipeline step, but I think you can generally use exit
and what you're
actually doing will be clearer.)
(I'm sure that I once knew all of this, but it fell out of my mind until I carefully worked it out again just now. Perhaps this time around it will stick.)
Sidebar: some of this behavior can vary by shell
Let's go back to '(/bin/echo hi; sleep 120) | cat
'. In Dash
and Bash, the first step's subshell sticks around to be the parent
process of sleep
, as mentioned. Somewhat to my surprise, both the
Fedora Linux version of official ksh93
and FreeBSD 10.4's sh
do
optimize away the subshell in this situation. They directly exec
the sleep
, as if you wrote:
(/bin/echo hi; exec sleep 120) | cat
There's probably a reason that Bash skips this little optimization.
2018-02-05
I should remember that sometimes C is a perfectly good option
Recently I found myself needing a Linux command that reported how
many CPUs are available for you to use. On Linux, the official way
to do this is to call sched_getaffinity
and
count how many 1 bits are set in the CPU mask that you get back.
My default tool for this sort of thing these days is Go and I found
some convenient support for this (in the golang.org/x/sys/unix
package), so I wrote the
obvious Go program:
package main
import ( "fmt" "os" "golang.org/x/sys/unix" ) func main() { var cpuset unix.CPUSet err := unix.SchedGetaffinity(0, &cpuset) if err != nil { fmt.Printf("numcpus: cannot get affinity: %s\n", err) os.Exit(1) } fmt.Printf("%d\n", cpuset.Count()) }
This compiled, ran on most of our machines, and then reported an
'invalid argument' error on some of them. After staring at strace
output for a while, I decided that I needed to write a C version
of this so I understood exactly what it was doing and what I was
seeing. I was expecting this to be annoying (because it would involve
writing code to count bits), but it turns out that there's a set
of macros for this
so the code is just:
#define _GNU_SOURCE #include <sched.h> #include <unistd.h> #include <stdio.h> #include <stdlib.h> #define MAXCPUS 0x400 int main(int argc, char **argv) { cpu_set_t *cpuset; cpuset = CPU_ALLOC(MAXCPUS); if (sched_getaffinity(0, CPU_ALLOC_SIZE(MAXCPUS), cpuset) < 0) { fprintf(stderr, "numcpus: sched_getaffinity: %m\n"); exit(1); } printf("%d\n", CPU_COUNT(cpuset)); }
(I think I have an unnecessary include file in there but I don't
care. I spray standard include files into my C programs until the
compiler stops complaining. Also, I'm using a convenient glibc
printf()
extension since I'm writing for Linux.)
This compiled, worked, and demonstrated that what I was seeing was indeed a bug in the x/sys/unix package. I don't blame Go for this, by the way. Bugs can happen anywhere, and they're generally more likely to happen in my code than in library code (that's one reason I like to use library code whenever possible).
The Go version and the C version are roughly the same number of lines and wound up being roughly as complicated to write (although the C version fails to check for an out of memory condition that's extremely unlikely to ever happen).
The Go version builds to a 64-bit Linux binary that is 1.1 Mbytes on disk. The C version builds to a 64-bit Linux binary that is 5 Kbytes on disk.
(This is not particularly Go's fault, lest people think that I'm picking on it. The Go binary is statically linked, for example, while the C version is dynamically linked; statically linking the C version results in an 892 Kbyte binary. Of course, in practice it's a lot easier to dynamically link and run a program written in C than in anything else because glibc is so pervasive.)
When I started writing this entry, I was going to say that what I
took from this is that sometimes C is the right answer. Perhaps it
is, but that's too strong a conclusion for this example. Yes, the
C version is the same size in source code and much smaller as a
binary (and that large Go binary does sort of offend my old time
Unix soul). But if the Go program had worked I wouldn't have cared
enough about its size to write a C version, and if the CPU_SET
macros didn't exist with exactly what I needed, the C version would
certainly have been more annoying to write. And there is merit in
focusing on a small set of tools that you like and know pretty well,
even if they're not the ideal fit for every situation.
But still. There is merit in remembering that C exists and is perfectly useful and many things, especially low level operating system things, are probably quite direct to do in C. I could probably write more C than I do, and sometimes it might be no more work than doing it in another language. And I'd get small binaries, which a part of me cares about.
(At the same time, these days I generally find C to be annoying. It forces me to care about things that I mostly don't want to care about any more, like memory handling and making sure that I'm not going to blow my foot off.)
PS: I'm a little bit surprised and depressed that the statically linked C program is so close to the Go program in size, because the Go program includes a lot of complex runtime support in that 1.1 Mbytes (including an entire garbage collector). The C program has no such excuses.