2012-03-16
Parsing versus rewriting: how to tell them apart
In light of my entry on wikitext transitions, you might ask how you tell when you have a parser instead of a rewriting engine. What is the line? It's not as simple as using or not using regular expressions; there are several wikitext parsers that are strongly regular expression based, some of them quite cleverly.
As it happens, I think there is a simple, easy to make distinction: a parser has separate input and output text spaces, at least on a conceptual basis, while a rewriter only has a single space that is both input and output. A rewriter transforms in place; a parser moves things from input to output (or more exactly consumes input and creates output, even if the output is a copy of the input).
(Note that a parser may have multiple levels of examination and the input and output spaces do not necessarily have to be in separate buffers. You could imagine a parsing engine that rewrote things 'in place', with a current position pointer that it always moved forward.)
This leads to a feverish thought experiment: can you somehow transform a regular expression based rewriter into a parser by making it output all of its transformed text into a new area, instead of replacing the text in place? After all, regexp replacement generally does in-place rewriting because that's the default and easiest way to operate.
I don't think such a rewrite would be easy; you would probably have to have multiple levels of regular expression rewrites, for example, and there are a bunch of tricky issues that in-place rewrites let you ignore (for example, in-place rewrites automatically handle all of the boring text that doesn't need to have anything done to it). But this idea might give some sort of structure for an attack on a particular rewrite engine.
(It also makes me wonder what a regular expression substitution engine that was not built around in-place rewrites would look like. Rob Pike's sam text editor had some support for something that was kind of like this, for doing multiple and potentially conflicting substitutions to the same block of text at once, but I never got it to work particularly well.)
2012-03-09
Why you do not want to patch your source code in place
Protip: if your software's build process involves patching the source in place, sooner or later a sysadmin will hunt you down.
Perhaps it is not obvious why this is a bad idea, so let me go on at more length than a tweet allows.
The problem with patching in place is not so much that patching your
source as part of building it is a bad idea in general (sometimes it's
necessary), it is that patching in place makes it really hard to clean
up after yourself if something goes wrong, especially if the patch is
only partially applied. When you patch in place, 'make distclean'
or the equivalent needs to revert the patch (and revert the right
patch) and make sure it removes any lingering artifacts. Doing this
reliably when something has gone wrong with the patching process is
somewhere between challenging and almost impossible, and when your
'make distclean' fails all that people can do is delete the entire
directory tree and re-extract it from your distribution tarball or VCS
repo. This is very annoying, to put it one way.
(That patching in place creates spurious changes in VCS status output is another sign that it is a bad idea. In a DVCS this makes it very easy for people to commit or stash changes that they should not and thus to get their copy of the repo into a completely snarled state.)
When you do not patch in place, cleanup is simple and reliable: delete the build directory that holds the patched source. You do not have to worry about partially applied patches or keep careful track of new files; everything just vanishes. Sysadmins never have to 'clean up' by deleting everything and restarting from scratch. And everything stays clean and clear for people working from your version control system (for example, those who are trying out your latest development version).
This entry has been brought to you by someone who had to type 'svn
revert -R .' a few too many times today.
(PS: once is too many times.)
Sidebar: why what RPM and Debian's package builder do is okay
Superficially, it looks like RPM (and I believe the Debian package manager) patch the source code of packages that they're building in place in this way, in that (eg) RPM directly applies whatever patches are specified on top of the source tree it extracted from your distribution tarball. The difference in what RPM does is that for RPM, the entire extracted source tree is simply a giant temporary build tree (and all of it will be removed to clean things up). The 'source' for RPM packages is the original distribution tarball and patch files, and these are never touched by the build process.
2012-03-01
A trick for dealing with irregular multi-word lines in shell scripts
Suppose that you have a bunch of lines in what I've sort of described as a 'key=value' format, that look like this:
<timestamp> key1=value1 key2=value2 key3=value3 ...
Also, let's suppose that the fields and their ordering isn't constant, for example some lines omit key2 and its value. If it wasn't for this inconsistency, there's lots of Unix tools that you could use; with this inconsistency, I can't think of a Unix program that naturally deals with this format (one where you can say 'give me key1 and key7' in the same easy way you can get field 1 and field 7 in awk).
Fortunately, Unix gives us some brute force tricks.
Selecting lines based on field contents is pretty easy:
grep ' key1=[^ ]*example'
(The space before the key name may not be necessary depending on what key names your file uses.)
I don't have any clever tricks if you want to aggregate or otherwise process several fields, but if you just want to pull out and analyze one field there is a brute force trick that you can often use. Let me show you a full command example:
egrep ' p=(1|0.9)' | tr ' ' '\012' | grep '^f=' | sed 's/.*@//' | howmany | sed 20q
The important trick is the tr combined with the grep. The tr
breaks each log file line apart so that each 'key=value' pair is on its
own line (by turning the spaces that separate fields into newlines).
Once each key=value pair is on a separate line, we can select just the
field we want and process it. Meanwhile the initial egrep is selecting
which whole lines we want to work on before the tr slices everything
apart.
Of course, you don't necessarily need the lines to be in 'key=value' format. A variant of this 'split words into separate lines' trick can be done to any file format where you can somehow match the individual 'words' that you want to further process. And you don't have to split on spaces; any distinguishing character will do.
(If the field separator is several characters you can split things
with sed. I used tr here because it's simpler for single-character
splitting.)
I call this brute force because we're not doing anything particularly clever to extract just the words we care about from inside each line. Instead we're slicing up everything and then throwing most of the pieces away.