2017-06-17
One reason you have a mysterious Unix file called 2
(or 1
)
Suppose, one day, that you look at the ls
of some directory and
you notice that you have an odd file called '2
' (just the digit).
If you look at the contents of this file, it probably has nothing
that's particularly odd looking; in fact, it likely looks like
plausible output from a command you might have run.
Congratulations, you've almost certainly fallen victim to a simple typo, one that's easy to make in interactive shell usage and in Bourne shell scripts. Here it is:
echo hi >&2 echo oop >2
The equivalent typo to create a file called 1
is very similar:
might-err 2>&1 | less might-oop 2>1 | less
(The 1
files created this way are often empty, although not always,
since many commands rarely produce anything on standard error.)
In each case, accidentally omitting the '&
' in the redirection
converts it from redirecting one file descriptor to another (for
instance, forcing echo
to report something to standard error)
into a plain redirect-to-file redirection where the name of the
file is your target file descriptor number.
Some of the time you'll notice the problem right away because you don't get output that you expect, but in other cases you may not notice for some time (or ever notice, if this was an interactive command and you just moved on after looking at the output as it was). Probably the easiest version of this typo to miss is in error messages in shell scripts:
if [ ! -f "$SOMETHING" ]; then echo "$0: missing file $SOMETHING" 1>2 echo "$0: aborting" 1>&2 exit 1 fi
You may never run the script in a way that triggers this error condition, and even if you do you may not realize (or remember) that you're supposed to get two error messages, not just the 'aborting' one.
(After we stumbled over such a file recently, I grep'd all of
my scripts for '>2
' and '>1
'. I was relieved not to find
any.)
(For more fun with redirection in the Bourne shell, see also how to pipe just standard error.)
2017-06-10
One downside of the traditional style of writing Unix manpages
A while back I wrote about waiting for a specific wall-clock time
in Unix, which according
to POSIX
you can do by using clock_nanosleep
with the CLOCK_REALTIME
clock and the TIMER_ABSTIME
flag. This is fully supported on
Linux (cf)
and not supported on FreeBSD. But here's a
question: is it supported on Illumos-derived systems?
So, let us consult the Illumos clock_nanosleep
manpage. This manpage is very
much written in the traditional (corporate) style of Unix manpages,
high on specification and low on extra frills. This style either
invites or actively requires a very close reading,
paying very careful attention to both what is said and what is not
said. The Illumos manpage does not explicitly say that your sleep
immediately ends if the system's wall clock time is adjusted forward
far enough; instead it says, well:
If the flag
TIMER_ABSTIME
is set in the flags argument, theclock_nanosleep()
function causes the current thread to be suspended from execution until either the time value of the clock specified byclock_id
reaches the absolute time specified by the rqtp argument, [or a signal happens]. [...]The suspension time caused by this function can be longer than requested because the argument value is rounded up to an integer multiple of the sleep resolution, or because of the scheduling of other activity by the system. [...] The suspension for the absolute
clock_nanosleep()
function (that is, with theTIMER_ABSTIME
flag set) will be in effect at least until the value of the corresponding clock reaches the absolute time specified byrqtp
, [except for signals].
On the surface, this certainly describes a fully-featured implementation
of clock_nanosleep
that behaves the way we want. Unfortunately,
if you're a neurotic reader of Unix manpages, all is not so clear.
The potential weasel words are 'the suspension ... will be in effect
at least until ...'. If you don't shorten CLOCK_REALTIME
timeouts
when the system clock jumps forward, you are technically having them
wait 'at least until' the clock reaches their timeout value, because
you sort of gave yourself room to have them wait (significantly) longer.
At the same time this is a somewhat perverse reading of the manpage,
partly because the first sentence of that paragraph alleges that the
system will only delay waking you up because of scheduling, which would
disallow this particular perversity.
To add to my uncertainty, let's look at the Illumos timer_settime
manpage, which contains the
following eyebrow-raising wording:
If the flag
TIMER_ABSTIME
is set in the argumentflags
,timer_settime()
behaves as if the time until next expiration is set to be equal to the difference between the absolute time specified by theit_value
member ofvalue
and the current value of the clock associated withtimerid
. That is, the timer expires when the clock reaches the value specified by theit_value
member ofvalue
. [...]
These two sentences do not appear to be equivalent for the case of
CLOCK_REALTIME
clocks. The first describes an algorithm that
freezes the time to (next) expiration when timer_settime
is
called, which is not proper CLOCK_REALTIME
behavior, and then
the second broadly describes correct CLOCK_REALTIME
behavior
where the timer expires if the real time clock advances past it for
any reason.
With all that said, Illumos probably fully implements CLOCK_REALTIME
,
with proper handling of the system time being adjusted while you're
suspended or have a timer set. But its manpages never comes out and
say that explicitly, because that's simply not the traditional style
of Unix manpages, and the way they're written leaves me with
uncertainty. If I cared about this, I would have to write a test
program and then run it on a machine where I could set the system
time both forward and backward.
This fault is not really with these specific Illumos manpages, although some elements of their wording aren't helping things. This is ultimately a downside to the terse, specification-like traditional style of Unix manpages. Where every word may count and the difference between 'digit' and 'digits' matters, you sooner or later get results like this, situations where you just can't tell.
(Yes, this would be a perverse implementation and a weird way of writing
the manpages, but (you might say) perhaps the original Solaris corporate
authors really didn't want to admit in plain text that Solaris didn't
have a complete implementation of CLOCK_REALTIME
.)
Also, I'm sure that different people will read these manpages
differently. My reading is unquestionably biased by knowing that
clock_nanosleep
support is not portable across all Unixes, so I
started out wondering if Illumos does support it. If you start reading
these manpages with the assumption that of course Illumos supports it,
then you get plenty of evidence for that position and all of the wording
that I'm jumpy about is obviously me being overly twitchy.
2017-06-04
Why the popen()
API works but more complex versions blow up
Years ago I wrote about a long-standing Unix issue with more
sophisticated versions of popen()
; my specific
example was writing a large amount of stuff to a subprogram through
a pipe and then reading its output, where both sides stall trying
to write to full pipes. Of course this is not the only way to have
this problem bite you, so recently I ran across Andrew Jorgensen's
A Tale of Two Pipes (via), where
the same problem comes up when a subprogram writes to both standard
output and standard error and you consume them one at a time.
Things like Python's subprocess
module and many other
imitators generally trace their core idea back to the venerable Unix
popen(3)
library function, which first appeared in V7 Unix.
However, popen()
itself does not actually have this problem; only
more sophisticated and capable interfaces based on it do.
The reason popen()
doesn't have the problem is straightforward
and points to the core problem with more elaborated versions of the
API. popen()
doesn't have a problem because it only gives you
a single IO stream, either the sub-program's standard input or its
standard output. More sophisticated APIs give you multiple streams,
and multiple streams are where you get into trouble. You get into
trouble because more sophisticated APIs with multiple streams are
implicitly pretending that the streams can be dealt with independently
and serially, ie that you can fully process one stream before looking
at another one at all. As A Tale of Two Pipes makes clear, this
is not so. In actuality the streams are inter-dependent and have
to be processed together, although Unix pipe buffers can hide this
from you for a while.
Of course you can handle the streams properly yourself, resorting
to poll()
or some similar measure. But you shouldn't have to
remember to do that, partly because as long as you have to take
additional complex steps to make things work right, people are going
to be forgetting this requirement. In the name of looking simple
and generic, these APIs have armed a gun that is pointed straight
at your feet. A more honest API would make the inter-dependency
clear, perhaps by returning a Subprocess
object that you registered
callbacks on. Callbacks have a bad reputation but they at least
make it clear that things can (and will) happen concurrently, instead
of one stream being fully handled before another stream is even
touched.
(Go has an interesting approach to the problem that is sort of half
solution and half not. In its core os/exec
API for this, you you provide streams which
will be read from or written to asynchronously. However there are
helper methods
that give you a more traditional 'here is a stream' interface and
with it the traditional problems.)
Sidebar: Why people keep creating these flawed subprogram APIs on Unix
These APIs keep getting created because they're attractive. How the
API appears to behave (ie, without the deadlock issues) is how
people often want to deal with subprograms. Most of the time you're
not interacting with them step by step, sending in some input and
collecting some output; instead you're sending in the input,
collecting the output, and maybe collecting standard error as well
in case something blew up. People don't want to write poll()
based
loops or callbacks or anything complicated, because concurrency is at
least annoying. They just want the simple API to work.
Possibly libraries should make the straightforward user code work by handling all of the polling and so on internally and being willing to buffer unlimited amounts of standard output and standard error. This would probably blow up less often than the current scheme does, and you could provide various options for how much to buffer and how to deal with overflow for advanced users.