Wandering Thoughts archives

2023-08-05

How the rc shell handles whitespace in various contexts

I recently read Mark Jason Dominus's The shell and its crappy handling of whitespace, which is about the Bourne shell and its many issues with whitespace in various places. I'm a long time user of (a version of) Tom Duff's rc shell (here's why I switched), which was written for Research Unix Version 10 and then Plan 9 to (in part) fix various issues with the Bourne shell. You might ask if rc solves these whitespace issues; my answer is that it about half solves them, and the remaining half is a hard to deal with problem area (although the version of rc I use doesn't support some things that would help).

(There's also the Plan 9 from User Space version of rc, which I believe is basically the original Plan 9 rc. The rc reimplementation that I use is mildly augmented and slightly incompatible with the Plan 9 rc.)

As covered in the manual page for the rc I use, shell variables in rc are fundamentally lists made up of one or more items (making the value of a shell variable be a zero length list effectively erases it). Rc draws a distinction between a variable that doesn't exist and a zero-length (empty) list:

; null='' empty=() echo $#null $#empty
1 0

('$#<var>' is how you get how many elements are in a list.)

When shell variables are expanded, they're replaced by their list of items, each of which becomes a separate argument. Given a hypothetical program that reports how many arguments it's been invoked with:

; l=(a b 'space separated')
; numargs $l
3
; v='some space separated thing'
; numargs $v
1

This means that rc needs no special handling for '$*', the shell variable of arguments to your shell script (or function within the script). It's a variable that is a list of all of the arguments, and if a particular argument has internal whitespace, that won't be expanded into multiple arguments when it's used. So you can write the following in complete confidence:

for (i in $*) {
  step1 $i
  step2 $i
}

(Rc provides a way to flatten a list into a space separated single value, if you want to do that, but mostly you don't.)

Similarly, you can safely use '$*' as a whole, as in the 'yell' example from the article:

#!/usr/bin/rc
printf 'I am about to run ''%s'' now!!!\n' $^*
exec $*

(Here we see a rare use for flatting a list to one element, right after I said you mostly don't need it.)

When rc expands filename wildcards, the result is a list where each element is a single filename, even if the filename has whitespace in it. You can assign this to a variable or use it directly in a loop, and either works correctly with no whitespace problems:

for (i in *.jpg) {cp $i /tmp}

But this is where the good news ends, because of good old fashioned Unix conventions for how programs produce (or report) multiple results. Consider the example that Dominus gave of changing the suffix of a bunch of files. In rc, the starting version of this is:

for (i in *.jpeg) {
  mv $i `{suf $i}^.jpg
}

However this has the same issue as the Bourne shell version. Rc's backquote substitution generates a list from the command's output, and normally it breaks the output into list elements based on whitespace. So if the 'suf' command prints out a result that has whitespace in it, rc will error out in the same way. We can see this in action with:

; l=`{echo 'one two three'}
; echo $#l
3

To step around this we need to use a special version of rc backquote substitution that specifies the word separator (this is a feature not in the Plan 9 rc, which requires you to change '$ifs'). But there is another trap here with the simple version, which is:

; l=`` () {echo one two three}
; echo $#l $l
1 one two three

;

We got an extra newline because rc took us at our word; when we said there was no separator, it didn't strip off the final newline that echo added. So to do what we want, we need to have a '$nl' variable with a newline in it and then write:

for (i in *.jpeg) {
  mv $i `` $nl {suf $i}^.jpg
}

Unfortunately, this won't work if any of the filenames have newlines in them. Fixing that is theoretically possible but much more complex (you need an auxiliary function to reassemble the output of 'suf' into a single variable with newline separation of the components).

Incidentally, this means that rc is worse than the Bourne shell for Dominus's 'lastdl' example, where that program reports the most recent downloaded file and is ideally used as 'something $(lastdl)'. In the Bourne shell you can at least force the right interpretation with a simple 'something "$(lastdl)"'. Since rc doesn't have a simple syntax for this forced single result case, you have to do something more verbose. If you have a '$nl' normally defined in your shell environment, you can write:

something `` $nl {lastdl}

which works but is far from aesthetic or pleasant. Frankly, I'd use Dominus's workaround of making lastdl rename to a safe name, which in the version of rc I use would let me write:

something `lastdl

It might be possible to add some features to rc to make this case a bit easier. For instance, maybe a `^{...} operation could flatten the backquote substitution list down to a space separated single argument (by analogy to rc's '$^<var>' that flattens a shell variable to a single element). Or there could be a special backquote version that takes everything literally but strips the trailing newline.

Another change to the version of rc that I use that could help in industrial strength scripts is the ability to define and use a shell variable that contains the null (zero) byte. In a hypothetical version of rc where this worked, you could then write:

z=`{printf '\0'}
names=`` $z {find whatever ... -print0}

Separation with zero bytes is the de facto Unix standard for safely passing around completely arbitrary filenames (well, file paths), since the zero byte is the one thing that can't appear in them.

(Bash doesn't do any better here, but it at least reports that it's ignoring the null byte when it handles the backquote substitution.)

In the modern Unix world, one practical workaround for many of these issues might be to remove space from your default interactive $IFS (although in a Bourne shell this has consequences for what "$*" expands to). A lot of the time these days space is effectively not a word splitting separator you normally want, because filenames and so on have spaces in them on a regular basis. In a new Unix shell, I would be quite tempted to make the default backquote substitution not split on spaces and have a longer form one that did, although maybe the whole area of backquote substitutions needs some deep thought.

(In a new Unix shell, wildcard filename expansion should definitely not perform word splitting on the result.)

PS: Drew DeVault is working on a shell that is called 'rc', and this shell may solve some of these whitespace problems, based on DeVault's post. However, DeVault's rc doesn't have syntax compatible with the original Duff rc and its reimplementations. I admit that I don't understand how DeVault's rc passes all of these cases based on its current manual page, because it says it splits the result of `{...} backquote substitution on (its) '$ifs', which is said to include all of the usual whitespace.

unix/RcShellWhitespaceHandling written at 23:05:57;


Page tools: See As Normal.
Search:
Login: Password:

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.