A gotcha with command order in pipes

October 11, 2007

Here's a mistake I've made more than once:

tail -f active-log | grep big-filter | grep -v one-thing

(Here the big-filter is something that selects only a small amount of actual interesting logfile output, and then you want to throw away a small bit more. For example, my most recent version of this was monitoring the exim log on our new mail system for failures and other anomalies, and then wanting to throw away one particular known failure.)

Why is this a mistake? Because you probably want to get interesting output as fast as possible, and this order doesn't do that; instead you get periodic large spurts of (delayed) output.

This happens because the big grep filter is only producing output periodically, and that happens because feeding grep's output into a pipe makes it block-buffered instead of line-buffered. So instead of immediately writing out any line that makes it past the filter, it sits around waiting for a buffer's worth, which may take a while.

(For GNU grep specifically you can use the --line-buffered option, but not all filtering tools have an equivalent one.)

The golden rule of pipelines like this is put the small volume reductions first and the big volume reductions last. This keeps as much volume as possible moving through each stage of the pipeline, so each stage flushes its output buffers as fast as possible. So in this case the correct order is:

tail -f active-log | grep -v one-thing | grep big-filter

The tail -f here will produce enough volume that the first grep is constantly sending output to the big filter at the end, and the big filter's output is going to your terminal so it's line-buffered.

(You do not need to worry about tail -f's buffering; tail always writes anything new it finds, even if it is going to a pipe. Or at least sane versions of tail do, including the GNU one.)


Comments on this page:

By Dan.Astoorian at 2007-10-12 11:51:58:

The golden rule of pipelines like this is put the small volume reductions first and the big volume reductions last.

That should be "put the big volume reduction"--singular--"last." Since every command except the last one is sending its output to a pipe, and therefore block-buffering its output, the input to the last command will be the same regardless of the order of the other commands in the pipeline.

Furthermore, I would say this is, in a broader sense, the exception rather than the rule: it doesn't generalize to other kinds of processing. Case in point: I once encountered a horrible reporting script which did (essentially):

   zcat /var/log/daemon.*.gz /var/log/daemon | sort | grep 'keyword'

with multimegabyte logs, where the keyword had perhaps a few hundred occurrences. The reason I was asked to look at the script was because it was intermittently failing: the logs were large enough that the sort would periodically fail by running out of file descriptors for its temporary files. Putting the grep before the sort sped up the script at least 5x and eliminated the intermittent failures. (Then I rewrote the script. The fact that it was trying to sort the contents of a set of consecutive log files, by date lexicographically, gives one an indication of the scarcity of common sense possessed by the original author.)

--Dan

From 66.243.153.70 at 2007-10-15 11:42:43:

Another tool for the toolkit is "unbuffer" from the expect distribution, or Dan Bernstein's "pty" program. They run a program with I/O connected to a pty, so the normal stdio library switches to line-buffered mode.

In this case you want to do as much data reduction as possible early in the pipeline, to improve overall efficiency.

tail -f file | unbuffer grep big__reduction | grep -v small__reduction

Icarus Sparry

By cks at 2007-10-15 19:39:14:

Furthermore, I would say this is, in a broader sense, the exception rather than the rule: [...]

Yes, of course; hence the 'like this' in my original sentence. Bulk data analysis is much different from trying to interactively follow your logs looking for interesting stuff.

From 203.144.24.214 at 2008-02-10 02:46:46:

I just had a similar issue to this myself recently. I needed to pipe the output of one script into another with line buffering as the final results are monitored live. "sed --unbuffered" and "grep --line-buffered" proved useful but I couldn't get other processes in the first script to flush their output when piped. Everything looked fine when script1 outputted to the pty.

Unfortunately this project had to run on Windows and I couldn't get "unbuffer" or "pty" compiled on Cygwin. Luckily I remembered socat and all was good. With a command line like "socat exec:script1,pty - | script2", I was able to get the desired results.

 - Jason
Written on 11 October 2007.
« How to properly look up hostnames from IP addresses
Getting your networks to your racks »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Thu Oct 11 22:15:37 2007
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.