2023-08-05
How the rc shell handles whitespace in various contexts
I recently read Mark Jason Dominus's The shell and its crappy handling of whitespace, which is about the Bourne shell and its many issues with whitespace in various places. I'm a long time user of (a version of) Tom Duff's rc shell (here's why I switched), which was written for Research Unix Version 10 and then Plan 9 to (in part) fix various issues with the Bourne shell. You might ask if rc solves these whitespace issues; my answer is that it about half solves them, and the remaining half is a hard to deal with problem area (although the version of rc I use doesn't support some things that would help).
(There's also the Plan 9 from User Space version of rc, which I believe is basically the original Plan 9 rc. The rc reimplementation that I use is mildly augmented and slightly incompatible with the Plan 9 rc.)
As covered in the manual page for the rc I use, shell variables in rc are fundamentally lists made up of one or more items (making the value of a shell variable be a zero length list effectively erases it). Rc draws a distinction between a variable that doesn't exist and a zero-length (empty) list:
; null='' empty=() echo $#null $#empty 1 0
('$#<var>
' is how you get how many elements are in a list.)
When shell variables are expanded, they're replaced by their list of items, each of which becomes a separate argument. Given a hypothetical program that reports how many arguments it's been invoked with:
; l=(a b 'space separated') ; numargs $l 3 ; v='some space separated thing' ; numargs $v 1
This means that rc needs no special handling for '$*
', the shell
variable of arguments to your shell script (or function within the
script). It's a variable that is a list of all of the arguments,
and if a particular argument has internal whitespace, that won't
be expanded into multiple arguments when it's used. So you can
write the following in complete confidence:
for (i in $*) { step1 $i step2 $i }
(Rc provides a way to flatten a list into a space separated single value, if you want to do that, but mostly you don't.)
Similarly, you can safely use '$*
' as a whole, as in the 'yell'
example from the article:
#!/usr/bin/rc printf 'I am about to run ''%s'' now!!!\n' $^* exec $*
(Here we see a rare use for flatting a list to one element, right after I said you mostly don't need it.)
When rc expands filename wildcards, the result is a list where each element is a single filename, even if the filename has whitespace in it. You can assign this to a variable or use it directly in a loop, and either works correctly with no whitespace problems:
for (i in *.jpg) {cp $i /tmp}
But this is where the good news ends, because of good old fashioned Unix conventions for how programs produce (or report) multiple results. Consider the example that Dominus gave of changing the suffix of a bunch of files. In rc, the starting version of this is:
for (i in *.jpeg) { mv $i `{suf $i}^.jpg }
However this has the same issue as the Bourne shell version. Rc's
backquote substitution generates a list from the command's output,
and normally it breaks the output into list elements based on
whitespace. So if the 'suf
' command prints out a result that has
whitespace in it, rc will error out in the same way. We can see
this in action with:
; l=`{echo 'one two three'} ; echo $#l 3
To step around this we need to use a special version of rc backquote
substitution that specifies the word separator (this is a feature
not in the Plan 9 rc, which requires you to change '$ifs
'). But
there is another trap here with the simple version, which is:
; l=`` () {echo one two three} ; echo $#l $l 1 one two three ;
We got an extra newline because rc took us at our word; when we
said there was no separator, it didn't strip off the final newline
that echo
added. So to do what we want, we need to have a '$nl
'
variable with a newline in it and then write:
for (i in *.jpeg) { mv $i `` $nl {suf $i}^.jpg }
Unfortunately, this won't work if any of the filenames have newlines
in them. Fixing that is theoretically possible but much more complex
(you need an auxiliary function to reassemble the output of 'suf
'
into a single variable with newline separation of the components).
Incidentally, this means that rc is worse than the Bourne shell for
Dominus's 'lastdl
' example, where that program reports the most
recent downloaded file and is ideally used as 'something $(lastdl)
'.
In the Bourne shell you can at least force the right interpretation
with a simple 'something "$(lastdl)"
'. Since rc doesn't have a
simple syntax for this forced single result case, you have to do
something more verbose. If you have a '$nl
' normally defined in
your shell environment, you can write:
something `` $nl {lastdl}
which works but is far from aesthetic or pleasant. Frankly, I'd
use Dominus's workaround of making lastdl
rename to a safe name,
which in the version of rc I use would let me write:
something `lastdl
It might be possible to add some features to rc to make this case
a bit easier. For instance, maybe a `^{...}
operation could
flatten the backquote substitution list down to a space separated
single argument (by analogy to rc's '$^<var>
' that flattens a
shell variable to a single element). Or there could be a special
backquote version that takes everything literally but strips the
trailing newline.
Another change to the version of rc that I use that could help in industrial strength scripts is the ability to define and use a shell variable that contains the null (zero) byte. In a hypothetical version of rc where this worked, you could then write:
z=`{printf '\0'} names=`` $z {find whatever ... -print0}
Separation with zero bytes is the de facto Unix standard for safely passing around completely arbitrary filenames (well, file paths), since the zero byte is the one thing that can't appear in them.
(Bash doesn't do any better here, but it at least reports that it's ignoring the null byte when it handles the backquote substitution.)
In the modern Unix world, one practical workaround for many of these
issues might be to remove space from your default interactive $IFS
(although in a Bourne shell this has consequences for what "$*"
expands to). A lot of the time these days space is effectively not
a word splitting separator you normally want, because filenames and
so on have spaces in them on a regular basis. In a new Unix shell,
I would be quite tempted to make the default backquote substitution
not split on spaces and have a longer form one that did, although
maybe the whole area of backquote substitutions needs some deep
thought.
(In a new Unix shell, wildcard filename expansion should definitely not perform word splitting on the result.)
PS: Drew DeVault is working on a shell that is called 'rc', and
this shell may solve some of these whitespace problems, based on
DeVault's post.
However, DeVault's rc doesn't have syntax compatible with the
original Duff rc and its reimplementations. I admit that I don't
understand how DeVault's rc passes all of these cases based on its
current manual page,
because it says it splits the result of `{...} backquote substitution
on (its) '$ifs
', which is said to include all of the usual
whitespace.