Unix's pipeline problem (okay, its problem with file redirection too)

May 6, 2015

In a comment on yesterday's entry, Mihai Cilidariu sensibly suggested that I not add timestamp support to my tools but instead outsource this to a separate program in a pipeline. In the process I would get general support for this and complete flexibility in the timestamp format. This is clearly and definitely the right Unix way to do this.

Unfortunately it's not a good way in practice, because of a fundamental pragmatic problem Unix has with pipelines. This is our old friend block buffering versus line buffering. A long time ago, Unix decided that many commands should change their behavior in the name of efficiency; if they wrote lines of output to a terminal you'd get each line as it was written, but if they wrote lines to anything else you'd only get it in blocks.

This is a big problem here because obviously a pipeline like 'monitor | timestamp' basically requires the monitor process to produce output a line at time in order to be useful; otherwise you'd get large blocks of lines that all had the same timestamp because they were written to the timestamp process in a block. The sudden conversion from line buffered to block buffered can also affect other sorts of pipeline usage.

It's certainly possible to create programs that don't have this problem, ones that always write a line at a time (or explicitly flush after every block of lines in a single report). But it is not the default, which means that if you write a program without thinking about it or being aware of the issue at all you wind up with a program that has this problem. In turn people like me can't assume that a random program we want to add timestamps to will do the right thing in a pipeline (or keep doing it).

(Sometimes the buffering can be an accidental property of how a program was implemented. If you first write a simple shell script that runs external commands and then rewrite it as a much better and more efficient Perl script, well, you've probably just added block buffering without realizing it.)

In the end, what all of this really does is that it chips away quietly at the Unix ideal that you can do everything with pipelines and that pipelining is the right way to do lots of stuff. Instead pipelining becomes mostly something that you do for bulk processing. If you use pipelines outside of bulk processing, sometimes it works, sometimes you need to remember odd workarounds so that it's mostly okay, and sometimes it doesn't do what you want at all. And unless you know Unix programming, why things are failing is pretty opaque (which doesn't encourage you to try doing things via pipelines).

(This is equally a potential problem with redirecting program output to files, but it usually hits most acutely with pipelines.)


Comments on this page:

By Albert at 2015-05-06 10:34:56:

GNU coreutils has a tool called "stdbuf" which takes care of this very use case. Also expect comes with a script called "unbuffer" which does mostly the same thing.

By mtk@acm.org at 2015-05-06 10:45:35:

it is possible to write an expect script to run a program as a stage in a pipeline but make it think that stdout & stderr are bound to a tty (psuedo-tty). e.g. i had this lying around:

use strict;
use Expect;

die "usage: $0 cmd\n" unless @ARGV;
STDOUT->autoflush (1);
my $p = Expect->spawn (@ARGV)
  or die "$0: failed to spawn '@ARGV': $!\n";
$p->expect (undef);
By Bob at 2015-05-10 14:39:10:

You can use stdbuf which will change the buffering semantics (this exists on non-coreutils platforms too)

Written on 06 May 2015.
« Monitoring tools should report timestamps (and what they're monitoring)
Why keeping output to 80 columns (or less) is still sensible »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed May 6 02:27:41 2015
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.