2023-11-23
Unix's 'test' program and the V7 Bourne shell
Recently I read Julio Merino's test, [, and [[ (via), which is in part
about there being a real '[
' binary and a 'test
' binary to go
along with it, and as part of that, Merino wonders why the name
'test
' exists at all. I don't have any specific insight into this,
but I can talk a bit about the history, which turns out to be more
tangled and peculiar than I thought.
The existence of 'test' goes back to V7 Unix, which is also where
the Bourne shell was introduced. In V7, the manual page for the
program is test(1), which
has no mention the '[' alternate name, and the source is cmd/test.c,
which has a comment at the start about the '[' usage and code to
support it. While 'test' is a much easier name to deal with in Unix
than '[', there seems to be more to this than just convenience.
There are a number of shell scripts, Makefiles, and so on in V7
Unix, and as far as I can tell all of them use 'test
' and none
of them use '[
'.
(For example, bin/nohup, bin/calendar, bin/lookbib, and usr/src/cmd/learn/makefile.)
Another source of information is S. R. Bourne's An Introduction
to the Unix Shell (also
PDF version
and the V7 troff sources). In
section 2.5, Bourne introduces the 'test
' command under that name,
and then goes on to use it with 'while
' (section 2.6) and 'if
'
(section 2.7). As far as I can see there's no mention of the '[
'
alternate name.
In trawling through various sources of information, I can't actually
find any clear sign that V7 ever had a '[
' hard link for 'test
'.
The test source code is definitely ready for this, but such a hard
link doesn't exist. 4BSD has a src/cmd/DESTINATIONS
file that suggests that /usr/bin/[ existed at this point (along
side /usr/bin/test), but that's the earliest trace I could find.
In 4.1c BSD we finally have clear evidence of /usr/bin/[ in the
form of src/bin/Makefile,
which explicitly creates it as a hard link to /usr/bin/test.
However, there's something rather interesting in the V7 Bourne shell
source code, in the form of vestigial, disabled support for a
'[
' builtin. In msg.c,
there is a commented out section toward the bottom:
[...] SYSTAB commands { {"cd", SYSCD}, {"read", SYSREAD}, /* {"[", SYSTST}, */ {"set", SYSSET}, [...]
Then in xec.c
there's commented out code that would have handled SYSTST in the
execute()
function:
[...] case SYSREAD: exitval=readvar(&com[1]); break; /* case SYSTST: exitval=testcmd(com); break; */ [...]
There's no actual 'testcmd()
' function in the V7 Bourne shell source
code, but we can guess what it might have done.
Given this disabled code and that the V7 'test
' itself supported
being used as '[
', it seems possible that this syntax was Bourne's
preference. It's possible that the builtin '[' was implemented and
then removed in favor of '[' being a hardlink to 'test', and then
for whatever reason other people in Bell Labs didn't use it and V7
wasn't distributed with such a hardlink set up (although individual
installs could make it themselves and it appears that the result
would work). However, this may have been the other way around, per
this HN comment,
with Bourne preferring the 'test' form over the '[' form.
As it happens, I don't think the 'test
' command (and its syntax)
appeared from nowhere in V7; instead I believe we can trace it to
antecedents in V6 Unix. But that's going to take another entry to
discuss, since this one is already long enough.
2023-11-07
The Vim features that make me a Vim user instead of a Vi user
Over on the Fediverse there was a little Vim versus Vi discussion, and in response to seeing it I posted something:
I used to be a minimal vi user. Over the years I've drifted to being a not so minimal vim user, and I think the vim features that I'm now addicted to are:
- infinite undo and redo (and a tree view of undo)
- unlimited backspacing in insert mode (true vi only lets you backspace so far)
- vim windows, which let me have multiple files on screen at once (this used to be vi's big limit versus emacs)
- recently, visual mode, both line and character.
(I use other vim things but these matter to me.)
(For example, vim settings for YAML and incrementing and decrementing numbers.)
Back in 2020 I wrote about realizing that I was now a Vim user, citing Vim's powerful undo and Vim windows; the other things I mentioned are new in my awareness since then. Unlimited backspacing in insert mode is one of those Vim features that are so instinctively right that I didn't realize (or remember) that classical Vi is rather more restricted that way, much like unlimited undo.
(OpenBSD vi only lets you backspace in insert mode within the current insertion, and my vague memory is that classical Vi may not have let you back up to previous lines even within a single insertion.)
Vim's visual mode is more specialized and limited, but for the kind of editing that I do it's turned out to be quite convenient, enough so that I use it regularly and would miss it if I had to do without it.
OpenBSD's vi is probably the closest I come today to a pure old fashioned Vi experience. I can definitely edit files in it without problems (I do every so often), and I often don't notice any difference from Vim if I'm editing a single file for straightforward changes (where I only need to undo simple mistakes immediately), which is the typical case for what I do on our OpenBSD machines. However, if I only had Vi and not Vim I probably wouldn't use vi(m) as much as I do today; I'd be much more likely to reach for other editors for multi-level undo and split screen editing of multiple files (with the ability to move text from one file to the other).
(I'd probably still use vi a lot, because the forces pushing me to it are fairly strong and I was using 'vim as vi' well before I started using 'vim as vim'.)
PS: I know that there are people who like the appeal of the simple (and BSD-pure) original Vi, but I'm not a Unix purist these days. I'm lazy; unlimited (and sophisticated) undo, backspacing as much as I want, multiple windows, and so on are all quite convenient (with very little effort on my part, and they work in all the many environments I use vim in). I use Vim instead of Vi for much the same reason that I now have file and command completion in my shell.
(I might feel differently about this if I'd been a heavy Vi user and was very used to its specific quirks, but I only started seriously using vi(m) in the Vim era.)
2023-10-24
dup()'s shared file IO offset is a necessary part of Unix
In a recent entry I noted dup() somewhat weird seeming behavior that the new file descriptor you get from dup() (and also from dup2(), its sometimes better sibling) shares the file's IO offset with the original file descriptor. This behavior is different from open()'ing the same file again, where you get a file descriptor with an independent file IO offset (sometimes called the seek offset or the seek position). In discussing this on the Fediverse, I wondered if this was only a convenient implementation choice. The answer, which I should have realized even at the time, is that dup()'s shared IO offset is a necessary part of Unix pipelines (especially in the context of older Unixes, such as V7 Unix).
Consider the following illustrative shell pipeline:
$ (cmd1 | cmd2 | cmd3) 2>/tmp/errors-log
Here we want to redirect any errors from these commands (and any sub-things they run) into /tmp/errors-log. We want all of the errors, with them in errors-log in the order they were printed by the various commands (which is not necessarily pipeline order; cmd3 could write some complaints before cmd2 did, for example).
If the shell opens /tmp/errors-log once and dup()'s the resulting
file descriptor to standard error for cmd1, cmd2, and cmd3, this
is exactly what you get, and it's because of that shared file IO
offset. Every time any of the commands writes to standard error,
they advance the offset of the next write() for all of the commands
at once. Today you could get the same effect for writes with
O_APPEND
, but that wasn't in V7 Unix
The shared offset also makes setting up standard input easier in some shell situations. Consider this:
$ (cmd1; cmd2; cmd3) <input-file
Implementing this without dup()'s shared IO offset would require that the parent shell set up standard input once, before it started forking children, so that it could pass the same file descriptor to all of them. With dup(), the parent can merely open input-file and then leave it to each child to dup() it on to standard input at an appropriate time.
There's a closely related idiom that also requires these dup() semantics even in a single process. Consider:
$ command >/tmp/out 2>&1
You want both standard output and standard error in the same file, interleaved in the order they were written, but in the child process these are necessarily two different file descriptors. You need them to share the IO offset anyway, which is achieved by dup()'ing one to the other (in a specific order, also).
Even without these dup() semantics, sharing the file IO offset of the same (inherited) file descriptor between processes is basically essential. Consider:
$ make >/tmp/output
Make will write to standard output and it will pass its own standard output file descriptor on to children (ie, all of the commands that get run from your Makefile) unchanged. All of the writes by all of the various processes to each individual file descriptor 1 have to all share an IO offset, or they'd repeatedly write over each other at the start of the file.
(You can create similar but more contrived examples with standard input coming from a file.)
Before I started writing this entry, I don't think I appreciated how important Unix's separation of the file IO offset from file descriptors is, or how deep it goes.
2023-10-22
Unix /dev/fd and dup(2)
I recently read Amber Screen's A tale of /dev/fd (via) which notes an odd behavior of Linux's /dev/fd and in fact of FreeBSD's /dev/fd as well, where if you try to open a /dev/fd/N name for a file descriptor, it does permissions checks and may refuse a process permissions to re-open a file descriptor it already has open. Screen writes:
To the dispassionate hacker (or a reader of Stevens), it's pretty clear that a syscall like
open("/dev/fd/5", O_RDONLY)
should be similar todup(5)
. [...]
'Similar' is an extremely important word here, because you probably don't actually want this to act as if you called dup() on the original file descriptor. But we'll get to that.
A narrow Linux-specific reason that opening /dev/fd/N forces permissions re-checks is probably partly because /dev/fd is a symbolic link to /proc/self/fd, which is a self-referential instance of a general Linux /proc facility that allows you to open what any process's file descriptor points to, provided that you have the necessary permissions. This includes file descriptors for things that are now deleted (I've used this to rescue a file that I accidentally removed but a process still had open). You obviously need to do permissions checks on this in general, so it would take extra code to special case re-opening your own file descriptors.
Somewhat more broadly, you also need some permissions checks no matter what, because the re-open of /dev/fd/N (aka /proc/self/fd/N) could be for a different mode than the file descriptor is currently open for. If your process currently has a file descriptor opened read only and you try to re-open it for write through /dev/fd/N (or /proc/self/fd/N), that had better check if you can actually write to the file. At the same time you probably do want to allow mode changes like this for /dev/fd/N provided that the underlying file permissions (and other situation) allow it, partly because programs may require it for some files they open.
(This isn't just for re-opening files with write permissions when you started with read permission. You might also have file descriptors opened only for write that you now want to read.)
The reason that you don't want /dev/fd/N to do a dup() is that dup()'d file descriptors share more things with each other than separately opened file descriptors on the same file. For one prominent example (noted in both Linux dup(2) and FreeBSD dup(2)), the file offset is shared between all dup()'d file descriptors. If one such descriptor reads, writes, or lseeks, all file descriptors now act on the new file offset. This is fine if a program expects this behavior (or reasonably should), because it obtained the file descriptor through a dup() or dup2(). It's quite possibly not fine if the program gets gifted this behavior simply because it open()'d what it sees as a separate file name; in fact, this behavior would be more or less in contradiction to open()'s normal specification (which promises you an independent file object). So you don't want to implement /dev/fd/N by directly doing a dup() on the given file descriptor if the modes match; you need to do something more complicated.
(To re-iterate Amber Screen's own quote, they aren't advocating for /dev/fd/N literally being a dup() of the file descriptor. As I read it, they are advocating for similar behavior with minimal permissions checking if required because you're changing the mode. Otherwise this /dev/fd/N re-open would make a new and separate copy of what POSIX calls the 'open file description' (see Linux open(2)'s discussion of this).)
2023-10-13
OpenBSD PF-based firewalls suffer differently from denial of service attacks
Suppose, hypothetically, that you have some DNS servers that are exposed to the Internet behind an OpenBSD PF-based firewall. Since you're a sensible person, you have various rate limits set in your DNS servers to prevent or at least mitigate various forms of denial of service attacks. One day, your DNS servers become extremely popular for whatever reason, your rate limits kick in, and your firewall abruptly stops allowing new connections in or out. What on earth happened?
The answer is that you ran out of room in the PF state table.
OpenBSD PF mostly works through state table entries,
and when a rule that normally would create a new state table entry
is unable to do so, the packet is dropped. This is somewhat documented
in places like the max
option for stateful rules:
Limits the number of concurrent states the rule may create. When this limit is reached, further packets that would create state are dropped until existing states time out.
(That this is more or less explicitly documented is better than it once was.)
One of the reasons that you can run out of state table entries despite your DNS servers dutifully rate-limiting their responses is that DNS is primarily UDP based and so PF doesn't really know if a given UDP 'connection' is 'closed' and so should have its state table entries cleaned up more aggressively. Instead, all PF does for UDP is guess timeouts based on packet counts, and those packet counts are for each unique set of source IP, source port, destination IP, and destination port. If your DNS query sources vary their source port for each query, this can add up fast.
(As we've seen, even TCP connections can linger in the state table for some time after they're closed.)
The current OpenBSD 7.3 manual page for pf.conf says that the default maximum size of the state table is only 100,000 entries, which is often effectively 50,000 'connections' (it's not uncommon for each connection to create two state table entries). It doesn't take a huge amount of bandwidth or a huge packets per second rate to exhaust that many state table entries, and it mostly doesn't matter whether or not your DNS servers actually respond to the queries.
That may sound odd so let's cover it explicitly. PF has three states for UDP traffic; 'first' if the source has only sent one packet, 'multiple' if both ends have sent packets, ie your DNS server responded, and 'single' if the source has sent multiple packets (with the same source port) without a response, ie your DNS server is dropping their queries and they're retrying. The first two states default to 60 second timeouts and the third defaults to a 30 second timeout, and that's after packets stop flowing. A DNS query source that keeps re-sending its query every fifteen seconds (with the same source port) will keep even a 'single' state entry alive forever.
As far as I can see, the only really good way to limit states created
by UDP traffic is to set a max
option on the rules involved. Often
this will cover only half of the states created by this traffic
(for reasons covered in my entry on state table entries). You can try to limit the number of source
IPs and states per IP that can be created (and do so across relevant
rules), but it's hard to come up with sensible numbers for both
that won't block legitimate traffic while also not letting people
blow out your state table.
(I assume without checking that you can set all of max
,
max-src-nodes
, and max-src-states
, and then have the total number
of state entries limited by max
instead of the product of the latter
two. This could be useful if you want some per-IP firewall limits in
addition to the total state limit, perhaps to insure that one or a few
IPs can't eat up all of the total allowed states.)
All of this is surprising if you're thinking of rate limiting and denial of service issues from the normal perspective of services on your hosts (such as DNS servers, or even web servers). In the host services world, if you reject or drop traffic through rate limiting, you're done with the traffic and you don't need to worry further (okay, yes, SYN cookies for TCP connection attempt traffic floods, but most things do that automatically today). But your OpenBSD PF firewall is still keeping state for that traffic your host rate-limited or dropped, and that state can (and will) add up, especially for UDP traffic.
2023-10-05
X's two ways to send events to X clients (more or less)
Once upon a time, the X11
protocol was shiny and new. One small part of the protocol (and the
client library) was a way for one client program to send X events,
such as key or mouse presses (or more routinely expected things)),
to another program. In Xlib, this
is XSendEvent()
(also).
When the target X client receives this event, it will have a special
flag set in the X event structure to
signal that it comes from a SendEvent request
from another client. Such events are normally called synthetic
events, because they were created synthetically by another X client,
instead of naturally by the X server.
X11 wasn't (and isn't) the most secure windowing system (in an access control sense), with it being pretty easy for people to connect X clients to your X server session. Partly because of this, X programs like xterm either started out being able to ignore such synthetic events (for key and mouse events, at least) or soon added this feature. As covered in the xterm manual page, this is allowSendEvents and is described this way:
Specifies whether or not synthetic key and button events (generated using the X protocol SendEvent request) should be interpreted or discarded. The default is "false" meaning they are discarded. Note that allowing such events would create a very large security hole, therefore enabling this resource forcefully disables the allowXXXOps resources. The default is "false".
If you hang around people who automate things in their X session, you may have heard of xdotool. If you've tried it, you may have noticed that xdotool seems pretty successful in manipulating the windows of X programs, despite the general feelings about SendEvents, and so you might wonder what's going on here. The answer is that xdotool (and other automation programs) use a second mechanism to inject synthetic events, the XTEST extension (protocol). The original purpose of this extension is, to quote its documentation:
This extension is a minimal set of client and server extensions required to completely test the X11 server with no user intervention.
Events injected through XTEST don't carry the 'SendEvents mark of shame', and so programs like xterm won't automatically reject them. However, due to its origins (and also probably security concerns), XTEST has certain limitations, and so in some circumstances xdotool has to fall back to (X)SendEvent(s) and suffer the mark of shame and perhaps having things not work.
A nice description of the situation is in the xdotool manual page's SENDEVENT NOTES section:
If you are trying to send key input to a specific window, and it does not appear to be working, then it's likely your application is ignoring the events xdotool is generating. This is fairly common.
Sending keystrokes to a specific window uses a different API than simply typing to the active window. If you specify 'xdotool type --window 12345 hello' xdotool will generate key events and send them directly to window 12345. However, X11 servers will set a special flag on all events generated in this way (see XEvent.xany.send_event in X11's manual). Many programs observe this flag and reject these events.
It is important to note that for key and mouse events, we only use XSendEvent when a specific window is targeted. Otherwise, we use XTEST.
If you don't target a specific window and use XTEST, it's like you typed the keys at your keyboard. The 'typed' keys go to whatever window has keyboard focus at the time xdotool runs, or into the void if no window has keyboard focus at the time. With SendEvent you can type the keys to a specific identified window no matter what else is going on, but interesting programs will probably ignore you.
(Even if programs don't ignore you because of the SendEvent mark, they may ignore you for other reasons. For example, Gnome Terminal appears to accept SendEvent keyboard input, but only if it currently has keyboard focus.)
(This entry is a variation on part of something I wrote recently on the fvwm mailing list.)
2023-09-29
Understanding the NMH repl command's '-cc me' and '-nocc me' options
Suppose, not hypothetically, that you use NMH as your mail client and that you would
like to cc: yourself on all of the mail you send; this is what I
do. It's relatively easy to set this up for the NMH comp
command,
which creates new messages. There are a number of approaches and
it's easy to understand all except the most complex ones, and since
you have to create the complex ones yourself, presumably anyone who
can set it up knows what they're doing. However, understanding what
you can do and how it works with the NMH repl
command
for replying to mail is not so straightforward or helpful, and for
years I've been not really understanding what I was doing with it.
(Conventionally NMH people who want to keep a copy of all of their email use 'Fcc:' to automatically file a copy in an NMH folder, but this has various issues similar to the IMAP Sent folder situation.)
When repl replies to a message, it has a set of options to control what additional addresses are included in the message; these are the '-cc all/to/cc/me' and their -nocc versions. What '-cc me' does seems conceptually simple (and it sounds like what I want), but the actual reality of both what it does and how it works is not. If you look at the repl 'replcomps' or 'replgroupcomps' file (the default one is normally in /etc/nmh), you will see a complex mh-format tangle. What this tangle does for the cc: header is it unconditionally takes a bunch of different (potential) sets of addresses and puts them into a bucket, de-duplicating addresses as it goes:
%(formataddr{to}) %(formataddr{cc}) %(formataddr(localmbox))
In effect, what repl does to implement -cc and -nocc is that it filters the resulting collection of addresses (I don't know if this is the literal implementation). An address will be in the actual reply's cc: only if it was in To:, cc: or your local address and the relevant command line switch was given (an option not selected defaults to off, so a bare '-cc to' is normally equivalent to '-cc to -nocc cc -nocc me'). So the repl components file is always trying to cc you (that's the 'localmbox' bit), but whether or not it succeeds depends on if '-cc me' is in effect.
If you specify '-cc me', then your local address (your primary address) is guaranteed to be included in the cc: list. Normally it will be included only once, even if it already appears in To:, cc:, or both; the addresses will be de-duplicated as they're added to the collection, including the addition at the end of 'localmbox' (your local address). If '-nocc me' is in effect, both your local address and any Alternate-Mailboxes addresses from your mh-profile. will be filtered out of cc: (and in fact To: too).
Unfortunately, repl provides no way to either remove only your alternate mailboxes (while still cc'ing your primary one) or not cc: your primary mailbox if any of your alternate mailboxes are already cc'd. You can either cc: yourself and any alternate mailboxes that are already present, or remove everything; alternate mailboxes effectively only get used for '-nocc me', not for '-cc me'. If you almost never want to use '-nocc me', this means that it may not be too useful to list your alternate mailboxes in your .mh_profile. There's also no way to leave any existing cc's of yourself intact while not adding a new one if there wasn't one already; either '-nocc me' will remove any existing ones, or '-cc me' will add one if it's not already there.
(If you try to strip your alternate mailboxes with '-nocc me -cc me', the '-nocc me' has no effect; it turns off '-cc me' and then you turn it back on.)
Because 'repl -group' implies '-cc all' as the default, which includes '-cc me', normally repl will explicitly add a cc: to you in all such group replies, in addition to whoever the original message was explicitly to or cc'd to. As mentioned earlier, there's no real neutral option even without this default '-cc all', since anything not explicitly mentioned as a -cc is (explicitly) filtered out. Repl just doesn't want to leave this alone.
(If you want repl to leave this alone and never want to cc yourself, you need a custom components file that removes the '%(formataddr(localmbox))' bit, and then always supply '-cc me' either implicitly or explicitly to keep repl from removing your addresses from the to and cc address lists.)
If you just want to get an exact copy of the same email everyone else got but don't care about your address appearing in the headers, you can probably use the under-documented NMH 'Dcc:' header, which does what normal mail agents use 'Bcc:' for. NMH has a 'Bcc:', but that doesn't get you a verbatim copy of the message because NMH has opinions there (see send and especially post). You'll need custom 'replcomps' and 'replgroupcomps' files that add 'Dcc: ...' headers. If you do this you probably want to use '-nocc me' all the time, to strip your addresses from the explicit headers.
(I recently did a bunch of experimentation to understand this as part of trying to improve my MH-E environment (cf), so I want to write it down while I remember it. In the end I was able to hack together some elisp magic for MH-E to mostly do what I want, although the result is imperfect.)
2023-09-19
How Unix shells used to be used as an access control mechanism
Once upon a time, one of the ways that system administrators controlled who could log in to what server was by assigning special administrative shells to logins, either on a particular system or across your entire server fleet. Today, special shells (mostly) aren't an effective mechanism for this any more, so modern Unix people may not have much exposure to this idea. However, vestiges of this live on in typical Unix configurations, in the form of /sbin/nologin (sometimes in /usr/sbin) and how many system accounts have this set as their shell in /etc/passwd.
The normal thing for /sbin/nologin to do when run is to print
something like 'This account is currently not available.' and exit
with status 1 (in a surprising bit of cross-Unix agreement, all of
Linux, FreeBSD, and OpenBSD nologin
appear to print exactly the
same message). By making this the shell of some account, anything
that executes an account's shell as part of accessing it will fail,
so login (locally or over SSH) and a normal su
will both fail.
Typical versions of su
usually have special features to keep you
from overriding this by supplying your own shell (often involving
/etc/shells
, or deliberately running the
/etc/passwd shell for the user). Otherwise, there
is nothing that prevents processes from running under the login's
UID, and in fact it's extremely common for such system accounts to
be running various processes.
Unix system administrators have long used this basic idea for their own purposes, creating their own fleet of administrative shells to, for example, tell you that a machine was only accessible by staff. You would then arrange for all non-staff logins on the machine to have that shell as their login shell (there might be such logins if, for example, the machine is a NFS fileserver). Taking the idea one step further, you might suspend accounts before deleting them by changing the account's shell to an administrative shell that printed out 'your account is suspended and will be deleted soon, contact <X> if you think this is a terrible mistake' and then exited. In an era when everyone accessed your services by logging in to your machines through SSH (or earlier, rlogin and telnet), this was an effective way of getting someone's attention and a reasonably effective way of denying them access (although even back then, the details could be complex).
(Our process for disabling accounts gives such accounts a special shell, but it's mostly as a marker for us for reasons covered in that entry.)
You could also use administrative shells to enforce special actions when people logged in. For example, newly created logins might be given a special shell that would make them agree to your usage policies, force them to change their password, and then through magic change their shell to a regular shell. Some of this could be done through existing system features (sometimes there was a way to mark a passwd entry so that it forced an immediate password change), but generally not all of it. Again, this worked well when you could count on people starting using your systems by logging in at the Unix level (which generally is no longer true).
Sensible system administrators didn't try to use administrative shells to restrict what people could do on a machine, because historically such 'restricted shells' had not been very successful at being restrictive. Either you let someone have access or you didn't, and any 'restriction' was generally temporary (such as forcing people to do one time actions on their first login). Used this way, administrative shells worked well enough that many old Unix environments accumulated a bunch of local ones, customized to print various different messages for various purposes.
PS: One trick you could do with some sorts of administrative shells was make them trigger alarms when run. If some people were not really supposed to even try to log in to some machine, you might want to know if someone tried. One reason this is potentially an interesting signal is that anyone who gets as far as running a login shell definitely knows the account's password (or otherwise can pass your local Unix authentication).
(These days I believe this would be considered a form of 'canary token'.)
2023-09-10
The roots of an obscure Bourne shell error message
Suppose that you're writing Bourne shell code that involves using some commands in a subshell to capture some information into a shell variable, 'AVAR=$(....)', but you accidentally write it with a space after the '='. Then you will get something like this:
$ AVAR= $(... | wc -l) sh: 107: command not found
So, why is this an error at all, and why do we get this weird and obscure error message? In the traditional Unix and Bourne shell way, this arises from a series of decisions that were each sensible in isolation.
To start with, we can set shell variables and their grown up friends
environment variables with 'AVAR=value
' (note the lack of spaces).
You can erase the value of a shell variable (but not unset it) by
leaving the value out, 'AVAR=
'. Let's illustrate:
$ export FRED=value $ printenv | fgrep FRED FRED=value $ FRED= $ printenv | fgrep FRED FRED= $ unset FRED $ printenv | fgrep FRED $ # ie, no output from printenv
Long ago, the Bourne shell recognized that you might want to only temporarily set the value of an environment variable for a single command. It was decided that this was a common enough thing that there should be a special syntax for it:
$ PATH=/special/bin:$PATH FRED=value acommand
This runs 'acommand
' with $PATH changed and $FRED set to a value,
without changing (or setting) either of them for anything else. We
have now armed one side of our obscure error, because if we write
'AVAR= ....
' (with the space), the Bourne shell will assume that
we're temporarily erasing the value of $AVAR (or setting it to a
blank value) for a single command.
The second part is that the Bourne shell allows commands to be run to be named through indirection, instead of having to be written out directly and literally. In Bourne shell, you can do this:
$ cmd=echo; $cmd hello world hello world $ cmd="echo hi there"; $cmd hi there
The Bourne shell doesn't restrict this indirection to direct expansion
of environment variables; any and all expansion operations can be
used to generate the command to be run and some or all of its
arguments. This includes subshell expansion, which is written
either as $(...)
in the modern way or as `...`
in the old way
(those are backticks, which may be hard to see in some fonts).
Doing this even for '$(...)' is reasonably sensible, probably
sometimes useful, and definitely avoids making $(...) a special
case here.
So now we have our perfect storm. If you write 'AVAR= $(....)
',
the Bourne shell first sees 'AVAR=
' (with the space) and interprets
it as you running some command with $AVAR set to a blank value.
Then it takes the '$(...)' and uses it to generate the command to
run (and its command line). When your subshell prints out its results,
for example the number of lines reported by 'wc -l
', the Bourne shell
will try to use that as a command and fail, resulting in our weird and
obscure error message. What you've accidentally written is similar to:
$ cmd=$(... | wc -l) $ AVAR= $cmd
(Assuming that the $(...) subshell doesn't do anything different based on $AVAR, which it probably doesn't.)
It's hard to see any simple change in the Bourne shell that could avoid this error, because each of the individual parts are sensible in isolation. It's only when they combine together like this that a simple mistake compounds into a weird error message.
(The good news is that shellcheck warns about both parts of this, in SC1007 and SC2091.)
2023-09-07
(Unix) Directory traversal and symbolic links
If and when you set out to traverse through a Unix directory
hierarchy, whether to inventory it or to find something, you have
a decision to make. I can put this decision in technical terms,
about whether you use stat()
or lstat()
when identifying
subdirectories in your current directory, or put it non-technically,
about whether or not you follow symbolic links that happen to point
to directories. As you might guess, there are two possible answers
here and neither is unambiguously wrong (or right). Which answer
programs choose depends on their objectives and their assumptions
about their environment.
The safer decision is to not follow symbolic links that point to
directories, which is to say to use lstat()
to find out what is
and isn't a directory. In practice, a Unix directory hierarchy
without symbolic links is a finite (although possibly large) tree
without loops, so traversing it is going to eventually end and not
have you trying to do infinite amounts of work. Partly due to this
safety property, most standard language and library functions to
walk (traverse) filesystem trees default to this approach, and some
may not even provide for following symbolic links to directories.
Examples are Python's os.walk()
, which defaults
to not following symbolic links, and Go's filepath.WalkDir()
, which doesn't even
provide an option to follow symbolic links.
(In theory you can construct both concrete and virtual filesystems that either have loops or, for virtual filesystems, are simply endless. In practice it is a social contract that filesystems don't do this, and if you break the social contract in your filesystem, it's considered your fault when people's programs and common Unix tools all start exploding.)
If a program follows symbolic links while walking directory trees,
it can be for two reasons. One of them is that the program wrote
its own directory traversal code and blindly used stat()
instead
of lstat()
. The other is that it deliberately decided to follow
symbolic links for flexibility. Following symbolic links is
potentially dangerous, since they can create loops, but it also
allows people to assemble a 'virtual' directory tree where the
component parts of it are in different filesystems or different
areas of the same filesystem. These days you can do some of this
with various sorts of 'bind' or 'loopback' mounts, but they generally
have more limitations than symbolic links do and often require
unusual privileges to set up. Anyone can make symbolic links to
anything, which is both their power and their danger.
(Except that sometimes Linux and other Unixes turns off your ability to make symlinks in some situations, for security reasons. These days the Linux sysctl is fs.protected_symlinks, and your Linux probably has it turned on.)
Programs that follow symbolic links during directory traversal aren't wrong, but they are making a more dangerous choice and one hopes they did it deliberately. Ideally such a program might have some safeguards, even optional ones, such as aborting if the traversal gets too deep or appears to be generating too many results.
PS: You may find the OpenBSD symlink(7) manual page interesting reading on the general topic of following or not following symbolic links.
2023-08-31
The technical merits of Wayland are mostly irrelevant
Today I read Wayland breaks your bad software (via), which is in large part an inventory of how Wayland is technically superior to X. I don't particularly disagree with Wayland's general technical merits and improvements, but at this point I think that they are mostly irrelevant. As such, I don't think that talking about them will do much to shift more people to Wayland.
(Of course, people have other reasons to talk about Wayland's technical merits. But a certain amount of this sort of writing seems to be aimed at persuading people to switch.)
I say that the technical merits are irrelevant because I don't believe that they're a major factor any more in most people moving or not moving to Wayland. At this point in time (and from my vantage point), there are roughly four groups of people still in the X camp:
- People on Unix environments that don't have Wayland support. They have
to use X or not have a graphical experience.
(Suggesting that these people change to a Linux environment with Wayland support is a non-starter; they are presumably using their current environment for good reasons.)
- People using mainstream desktop environments that already support
Wayland, primarily GNOME and KDE, in relatively stock ways.
Switching to Wayland is generally transparent for these people
and happens when their Linux distribution decides to change the
default for their hardware. If their Linux distribution has not
switched the default, there is often good reason for it.
Most of these people will switch over time as their distribution changes defaults, and they're unlikely to switch before then.
- People using desktop environments
or custom X setups that don't (currently) support Wayland. Switching
to Wayland is extremely non-transparent for these people because
they will have to change their desktop environment (so far, to
GNOME or KDE) or reconstruct a Wayland version of it.
Back in 2021, this included XFCE and Cinnamon,
and based on modest Internet searches I believe it still does.
One can hope that some of these desktop environments will get Wayland support over time, moving people using them up into the previous category (and probably moving them to Wayland users). However the primary bottleneck for this is probably time and attention from developers (who by now probably have heard lots about why people think they should add support for Wayland and its technical merits).
- People who could theoretically switch to Wayland and who might gain benefits from doing so, but who have found good reasons (often related to hardware support) that X works better for them (cf some of the replies to my Fediverse post).
(There are other smaller groups not included here, such as people who have a critical reliance on X features not yet well supported in Wayland.)
With only a slight amount of generalization, none of these people will be moved by Wayland's technical merits. The energetic people who could be persuaded by technical merits to go through switching desktop environments or in some cases replacing hardware (or accepting limited features) have mostly moved to Wayland already. The people who remain on X are there either because they don't want to rebuild their desktop environment, they don't want to do without features and performance they currently have, or their Linux distribution doesn't think their desktop should switch to Wayland yet.
There are still some people who would get enough benefit from what Wayland improves over X that it would be worth their time and effort to switch, even at the cost of rebuilding their desktop environment (and possibly losing some features, because there are things that X does better than Wayland today). But I don't think there are very many of them left by now, and if they're out there, they're hard to reach, since the Wayland people have been banging this drum for quite a while now.
(My distant personal view of Wayland hasn't changed since 2021, since Cinnamon still hasn't become Wayland enabled as far as I know.)
PS: The other sense that Wayland's technical merits are mostly irrelevant is that everyone agrees that Wayland is the future of Unix graphics and development of the X server is dead. Unless and until people show up to revive X server development, Wayland is the only game in town, and when you have a monopoly, your technical merits don't really matter.