Wandering Thoughts

2017-09-30

The origin of POSIX as I learned the story (or mythology)

I recently wound up rereading Jeremy Allison's A Tale of Two Standards (via Everything you never wanted to know about file locking), which tells an origin story for the POSIX standard for Unix where it was driven by ISVs wanting a common 'Unix' API that they could write their products to so they'd be portable across all the various Unix versions. It's quite likely that this origin story is accurate, and certainly the divergence in Unixes irritated ISVs (and everyone else) at the time. However, it is not the origin mythology for POSIX that I learned during my early years with Unix, so here is the version I learned.

During the mid and late 1980s, the US government had a procurement problem; it wanted to buy Unix systems, but perfectly sensible procurement rules made this rather hard. If it tried to simply issue a procurement request to buy from, say, Sun, companies like SGI and DEC and so on would naturally object and demand answers for how and why the government had decided that their systems wouldn't do. If the government expanded the procurement request to include other Unix vendors so they could also bid on it (saying 'any workstation with these hardware specifications' or the like), people like IBM or DEC would demand answers for why their non-Unix systems wouldn't do. And if the government said 'fine, we want Unix systems', it was faced with the problem of actually describing what Unix was in the procurement request (ideally in a form that was vendor neutral, since procurement rules frown on supposedly open requests that clearly favour one vendor or a small group of them).

This government procurement problem is not unique to Unix, and the usual solution to it is a standard. Once the government has a standard, either of its own devising or specified by someone else, it can simply issue a procurement request saying 'we need something conforming to standard X', and in theory everyone with a qualifying product can bid and people who don't have such a product have no grounds for complaint (or at least they have less grounds for complaint; they have to try to claim you picked the wrong standard or an unnecessary standard).

Hence, straightforwardly, POSIX, and also why Unix vendors cared about POSIX as much as they did at the time. It wasn't just to make the life of ISVs easier; it was also because the government was going to be specifying POSIX in procurement bids, and most of the Unix vendors didn't want to be left out. In the process, POSIX painstakingly nailed down a great deal of what the 'Unix' API is (not just at the C level but also for things like the shell and command environment), invented some genuinely useful things, and pushed towards creating and standardizing some new ideas (POSIX threading, for example, was mostly not standardizing existing practice).

PS: You might wonder why the government didn't just say 'must conform to the System V Interface Definition version N' in procurement requests. My understanding is that procurement rules frown on single-vendor standards, and that was what the SVID was; it was created by and for AT&T. Also, at the time requiring the SVID would have left out Sun and various other people that the government probably wanted to be able to buy Unixes from.

(See also the Wikipedia entry on the Unix wars, which has some useful chronology.)

POSIXOriginStory written at 20:51:12; Add Comment

2017-09-29

Shell builtin versions of standard commands have drawbacks

I'll start with a specific illustration of the general problem:

bash# kill -SIGRTMIN+22 1
bash: kill: SIGRTMIN+22: invalid signal specification
bash# /bin/kill -SIGRTMIN+22 1
bash#

The first thing is that yes, this is Linux being a bit unusual. Linux has significantly extended the usual range of Unix signal numbers to include POSIX.1-2001 realtime signals, and then can vary what SIGRTMIN is depending on how a system is set up. Once Linux had these extra signals (and defined in the way they are), people sensibly added support for them to versions of kill. All of this is perfectly in accord with the broad Unix philosophy; of course if you add a new facility to the system you want to expose it to shell scripts when that's possible.

Then along came Bash. Bash is cross-Unix, and it has a builtin kill command, and for whatever reason the Bash people didn't modify Bash so that on Linux it would support the SIGRTMIN+<n> syntax (some possible reasons for that are contained in this sentence). The results of that are a divergence between the behavior of Bash's kill builtin and the real kill program that have become increasingly relevant now that programs like systemd are taking advantage of the extra signals to allow you to control more of their operations by sending them more signals.

Of course, this is a generic problem with shell builtins that shadow real programs in any (and all) shells; it's not particularly specific to Bash (zsh also has this issue on Linux, for example). There are advantages to having builtins, including builtins of things like kill, but there are also drawbacks. How best to fix or work around them isn't clear.

(kill is often a builtin in shells with job control, Bash included, so that you can do 'kill %<n>' and the like. Things like test are often made builtins for shell script speed, although Unixes can take that too far.)

PS: certainly one answer is 'have Bash implement the union of all special kill, test, and so on features from all Unixes it runs on', but I'm not sure that's going to work in practice. And Bash is just one of several popular shells, all of whom would need to keep up with things (or at least people probably want them to do so).

BashKillBuiltinDrawback written at 21:40:28; Add Comment

2017-09-23

A clever way of killing groups of processes

While reading parts of the systemd source code that handle late stage shutdown, I ran across an oddity in the code that's used to kill all remaining processes. A simplified version of the code looks like this:

void broadcast_signal(int sig, [...]) {
   [...]
   kill(-1, SIGSTOP);

   killall(sig, pids, send_sighup);

   kill(-1, SIGCONT);
   [...]
}

(I've removed error checking and some other things; you can see the original here.)

This is called to send signals like SIGTERM and SIGKILL to everything. At first the use of SIGSTOP and SIGCONT puzzled me, and I wondered if there was some special behavior in Linux if you SIGTERM'd a SIGSTOP'd process. Then the penny dropped; by SIGSTOPing processes first, we're avoiding any thundering herd problems when processes start dying.

Even if you use kill(-1, <signal>), the kernel doesn't necessarily guarantee that all processes will receive the signal at once before any of them are scheduled. So imagine you have a shell pipeline that's remained intact all the way into late-stage shutdown, and all of the processes involved in it are blocked:

proc1 | proc2 | proc3 | proc4 | proc5

It's perfectly valid for the kernel to deliver a SIGTERM to proc1, immediately kill the process because it has no signal handler, close proc1's standard output pipe as part of process termination, and then wake up proc2 because now its standard input has hit end-of-file, even though either you or the kernel will very soon send proc2 its own SIGTERM signal that will cause it to die in turn. This and similar cases, such as a parent waiting for children to exit, can easily lead to highly unproductive system thrashing as processes are woken up unnecessarily. And if a process has a SIGTERM signal handler, the kernel will of course schedule it to wake up and may start it running immediately, especially on a multi-core system.

Sending everyone a SIGSTOP before the real signal completely avoids this. With all processes suspended, all of them will get your signal before any of them can wake up from other causes. If they're going to die from the signal, they'll die on the spot; they're not going to die (because you're starting with SIGTERM or SIGHUP and they block or handle it), they'll only get woken up at the end, after most of the dust has settled. It's a great solution to a subtle issue.

(If you're sending SIGKILL to everyone, most or all of them will never wake up; they'll all be terminated unless something terrible has gone wrong. This means this SIGSTOP trick avoids ever having any of the processes run; you freeze them all and then they die quietly. This is exactly what you want to happen at the end of system shutdown.)

ProcessKillingTrick written at 02:42:54; Add Comment

2017-09-13

System shutdown is complicated and involves policy decisions

I've been a little harsh lately on how systemd has been (not) shutting down our systems, and certainly it has some issues and it could be better. But I want to note that in general and in practice, shutting down a Unix system is a complicated thing that involves tradeoffs and policy decisions; in fact I maintain that it's harder than booting the system. Further, the more full-featured you attempt to make system shutdown, the more policy decisions and tradeoffs you need to make.

(The only way to make system shutdown simple is to have a very minimal view of it and to essentially crash the running system, as original BSD did. This is a valid choice and certainly systems should be able to deal with abrupt crashes, since they do happen, but it isn't necessarily a great one. Your database can recover after a crash-stop, but it will probably be happier if you let it shut down neatly and it may well start faster that way.)

One of the problems that makes shutdown complicated is that on the one hand, stopping things can fail, and on the other hand, when you shut down the system you want and often need for it to actually go down, so overall system shutdown can't fail. Reconciling these conflicting facts requires policy decisions, because there is no clear universal technical answer for what you do if a service shutdown fails (ie the service process or processes remain running), or a filesystem can't be unmounted, or some piece of hardware says 'no, I am not detaching and shutting down'. Do you continue on with the rest of the shutdown process and try again later? Do you start killing processes that might be holding things busy? What do you do about your normal shutdown ordering requirements, for example do you block further services and so on from shutting down just yet, or do you continue on (and perhaps let them make their own decisions about whether they can shut down)?

There are no one size fits all answers to these questions and issues, especially if the init system is essentially blind to the specific nature of the services involved and treats them as generic 'services' with generic 'shutdown' actions. Even in an init system where the answers to these questions can be configured on a per-service or per-item basis, someone has to do that configuration and get it right (which may be complicated by an init system that doesn't distinguish between the different contexts of stopping a specific service, which means that you get to pick your poison).

While it's not trivial, it's not particularly difficult for an init system to reliably shut down machines if and when all of the individual service and item shutdowns go fine and all of the dependencies are fully expressed (and correct), so that everything is stopped in the right order. But this is the easy case. The hard case for all init systems is when something goes wrong, and many init systems have historically had issues here.

(Many implementations of System V init would simply stall the entire system shutdown if an '/etc/init.d/<whatever> stop' operation hung, for example.)

PS: One obvious pragmatic question and problem is how and when you give up on an orderly shutdown of a service and (perhaps) switch over to things like killing processes. Services may legitimately take some time to shut down, in order to flush out data, close databases properly, and so on, but they can also hang during shutdown for all sorts of reasons. This is especially relevant in any init system that shuts down multiple services in parallel, because each service being shut down could suddenly want a bunch of resources.

(One of the fun cases is where you have heavyweight daemons that are all inactive and paged out of RAM, and you ask them to do an orderly shutdown, which suddenly causes everything to try to page back in to your limited RAM. I've been there in a similar situation.)

ShutdownComplicated written at 01:47:11; Add Comment

2017-09-12

The different contexts of stopping a Unix daemon or service

Most Unix init systems have a single way of stopping a daemon or a service, and on the surface this feels correct. And mostly it is, and mostly it works. However, I've recently come around to believing that this is a mistake and an over-generalization. I now believe that there are three different contexts and you may well want to stop things somewhat differently in each, depending on the daemon or service. This is especially the case if the daemon spawns multiple and somewhat independent processes as part of its operation, but it can happen in other situations as well, such as the daemon handling relatively long-running requests. To make this concrete I'm going to use the case of cron and long-running cron jobs, as well as Apache (or the web server of your choice).

The first context of stopping a daemon is a service restart, for example if the package management system is installing an updated version. Here you often don't want to abruptly stop everything the daemon is running. In the case of cron, you probably don't want a daemon restart to kill and perhaps restart all currently running cron jobs; for Apache, you probably want to let current requests complete, although this depends on what you're doing with Apache and how you have it configured.

The second context is taking down the service with no intention to restart it in the near future. You're stopping Apache for a while, or perhaps shutting down cron during a piece of delicate system maintenance, or even turning off the SSH daemon. Here you're much more likely to want running cron jobs, web requests, and even SSH logins to shut down, although you may want the init system to give them some grace time. This may actually be two contexts, one where you want a relatively graceful stop versus one where you really want an emergency shutdown with everything screeching to an immediate halt.

The third context is stopping the service during system shutdown. Here you unambiguously want everything involved with the daemon to stop, because everything on the system has to stop sooner or later. You almost always want everything associated with the daemon to stop as a group, more or less at the same time; among other reasons this keeps shutdown ordering sensible. If you need Apache to shut down before some backend service, you likely don't want lingering Apache sub-processes hanging around just because their request is taking a while to finish. Or at a minimum you don't want Apache to be considered 'down' for shutdown ordering until the last little bits die off.

As we see here, the first and the third context can easily conflict with each other; what you want for service restart can be the complete opposite of what you want during system shutdown. And an emergency service stop might mean you want an even more abrupt halt than you do during system shutdown. In hindsight, trying to treat all of these different contexts the same is over-generalization. The only time when they're all the same is when you have a simple single-process daemon, at which point there's only ever one version of shutting down the daemon; if the daemon process isn't running, that's it.

(As you might suspect, these thoughts are fallout from our Ubuntu shutdown problems.)

PS: While not all init systems are supervisory, almost all of them include some broad idea of how services are stopped as well as how they're started. System V init is an example of a passive init system that still has a distinct and well defined process for shutting down services. The one exception that I know of is original BSD, where there was no real concept of 'shutting down the system' as a process; instead reboot simply terminated all processes on the spot.

ThreeTypesOfServiceStop written at 01:12:41; Add Comment

2017-08-20

The surprising longevity of Unix manpage formatting

As part of writing yesterday's entry, I wound up wanting to read the 4.3 BSD ifconfig manpage, which is online as part of the 4.3 BSD source tree at tuhs.org. More exactly, I wanted to see more or less how it had originally looked in formatted form, because in source form the bit I was interested in wasn't too readable:

.TP 15
.BI netmask " mask"
(Inet only)
Specify how much of the address to reserve for subdividing
networks into sub-networks.
[...]

If I wrote and dealt with Unix manpages more than occasionally, perhaps I could read this off the top of my head, but as it is, I'm not that familiar with the raw *roff format of manpages. So I decided to start with the simplest, most brute force way of seeing a formatted version. I copied the raw text into a file on my Linux machine and then ran 'man -l' on it. What I hoped for was something that wasn't too terribly mangled so that I could more or less guess at the original formatting. What I got was a manpage that was almost completely intact (and possible it's completely intact).

To me, this is surprising and impressive longevity in manpage formatting. The 4.3 BSD ifconfig manpage dates from 1986, so that's more than 30 years of compatibility, and we can go back even further; it appears that V7 manpages (such as the one for ls) still format fine.

(V6 manpages are where things break, because apparently the man *roff macros were changed significantly between V6 and V7. One of the visible signs of this is that many of the macros were upper-cased; the V6 ls manpage has things like .th instead of the V7 .TH.)

I'm not going to speculate on reasons for why the man macros have been so stable for so long, but one thing it suggests to me is that the initial V7 set probably didn't have anything particularly wrong with them. On the other hand, the BSD people did build a completely new set of manpage macros in 4.3 Reno and Net/2, the mdoc macros, which have been carried forward into the current *BSDs.

(For more on this, see both The History of Unix Manpages and Wikipedia's history of manpages.)

ManpageMacroLongevity written at 01:36:44; Add Comment

2017-08-19

Subnets and early Unix implementations of TCP/IP networking

If you've been involved in networking (well, Internet and IP networking at least), you've probably heard and used the term 'subnet' and 'subnets'. As a term, subnet has a logical and completely sensible definition, and certainly the direct meaning is probably part of why we wound up with the term. If you've been around networking a while, you've probably also heard of 'CIDR' notation for networks, for example 192.168.1.0/24, and you may know that CIDR stands for Classless Inter-Domain Routing. You may even have heard of 'class C' and 'class B' networks, and had people refer to /24 CIDRs and /16 CIDRs as 'class C' and 'class B' respectively.

Back in the early days of IP, the entire IP network address space was statically divided up into a number of chunks of different sizes, and these different sizes were called the class of the particular chunk or network. When I say 'statically divided', I mean that what sort of network you had was determined by what your IP address was. If your IP address was, say, 8.10.20.30, you were in a class A network and everything in 8.*.*.* was (theoretically) on the same network as your machine. You can read the full details and the history on Wikipedia.

At the very beginning of Unix's support for IP networking, in 4.2 BSD, this approach was fine (and anyway it was the standard for how IP address space was divided up, so it was just how you did IP networking at the time). There were Ethernet LANs using IP (RFC 826 on ARP dates from that time), but there weren't many machines on them regardless of what class the network was. As a result of this, 4.2 BSD had no concept of network masks for interfaces. Really, it's right there in the 4.2 BSD ifconfig manpage; in 4.2 BSD, network interfaces were configured purely by specifying the interface's IP address. If the IP address was in a class A network, or a class B network, or a class C one, that was what you got; the 4.2 BSD kernel directly hard-coded how to split an arbitrary IP address into the standard (classful) network and host portions.

(See in_netof and in_lnaof in the 4.2 BSD sys/netinet/in.c.)

Naturally, this didn't last very long. The universities, research organizations, and so on that started using 4.2 BSD were also the places that got large class A (/8) and class B (/16) networks in the early days of the ARPANET, and so pretty soon they had far too many hosts to have them all on a single 10 MBit/sec LAN (especially once they had hosts in several different buildings). As a result, Unix networking (ie BSD networking) gained the concept of a netmask and subnets. The 4.3 BSD ifconfig manpage describes it this way:

netmask mask
(Inet only) Specify how much of the address to reserve for subdividing networks into sub-networks. The mask includes the network part of the local address and the subnet part, which is taken from the host field of the address. [...]

The idea here was that your university might have a class B, but your department would have what we would now call a /24 from that class B. The normal class B netmask is 255.255.0.0, but on your machines you'd set the netmask as 255.255.255.0 so they'd know that addresses outside that were not on the local network and had to be sent through the router instead of ARP'd for.

However, the bad news is that the non-netmask version of 4.2 BSD IP networking did last long enough to get out into the field, both in real 4.2 BSD machines and in early versions of commercial Unixes like SunOS. Some of these early SunOS workstations and servers were bought by universities with class B networks that were starting to subnet them. This wound up causing the obvious fun problems, where some of your department's machines might not be able to talk to the rest of the university because they were grimly determined that they were on a class B network and so could reach every host in 128.100.*.* on the local LAN.

(They could reach hosts that weren't on 128.100.*.* just fine, at least if you configured the right gateway.)

It turns out that this history is visible in an interesting series of RFCs. RFC 917 from 1984 begins the conversation on subnets, then RFC 925 suggests extending ARP to work across multiple interconnected LANs. Subnets are formalized in RFC 950, "proxy ARP" appears in RFC 1009, and finally RFC 1027 describes how the authors used proxy ARP at the University of Texas at Austin to implement transparent subnet gateways, where hosts on your (sub)net don't have to be aware that they are on a subnet instead of the full class A or class B network that they think they're on. Transparent subnet gateways are also know as 'how you get your 4.2 BSD and SunOS 2.x hosts to talk to the rest of the university'.

(Since IP networking started out by talking about 'networks', not 'subnets', it seems highly likely that our current use of subnet' comes from this terminology invention and growth in the early to mid 1980s. I find it interesting that the 1986 4.3 BSD ifconfig manpage is still talking about 'sub-networks' instead of shortening it to 'subnets'.)

SubnetsAndEarlyUnixIPv4 written at 01:31:32; Add Comment

2017-07-17

Why I think Emacs readline bindings work better than Vi ones

I recently saw a discussion about whether people used the Emacs bindings for readline editing or the Vi bindings (primarily in shells, although there are plenty of other places that use readline). The discussion made me realize that I actually had some opinions here, and that my view was that Emacs bindings are better.

(My personal background is that vim has been my primary editor for years now, but I use Emacs bindings in readline and can't imagine switching.)

The Emacs bindings for readline aren't better because Emacs bindings are better in general (I have no opinion on that for various reasons). Instead, they're better here because the nature of Emacs bindings make going back and forth between entering text and editing text, especially without errors. This is because Emacs bindings don't reuse normal characters. Vi gains a certain amount of its power and speed from reusing normal letters for editing commands (especially lower case letters, which are the easiest to type), while Emacs exiles all editing commands to special key sequences. Vi's choice is fine for large scale text editing, where you generally spend substantial blocks of time first entering text and then editing it, but it is not as great if you're constantly going back and forth over short periods of time, which is much more typical of how I do things in a single command line. The vi approach also opens you up to destructive errors if you forget that you're in editing mode. With Emacs bindings there is no such back and forth switching or confusion (well, mostly no such, as there are still times when plain letters are special or control and meta characters aren't).

Another way of putting this is that Emacs bindings at least feel like they're optimized for quickly making small edits, while vi ones feel more optimized for longer, larger-scale edits. Since typo-fixes and the like are most of what I do with command line editing, it falls into the 'small edits' camp where Emacs bindings shine.

Sidebar: Let's admit to the practical side too

Readline defaults to Emacs style bindings. If you only use a few readline programs on a few systems, it's probably no big deal to change the bindings (hopefully they all respect $HOME/.inputrc). But I'm a sysadmin, and I routinely use many systems (some of them not configured at all) as many users (me, root, and others). Trying to change all of those readline configurations is simply not feasible, plus some programs use alternate readline libraries that may not have switchable bindings.

In this overall environment, sticking with the default Emacs bindings is far easier and thus I may be justifying to myself why it 'makes sense' to do so. I do think that Emacs bindings make quick edits easier, but to really be sure of that I'd have to switch a frequently used part of my environment to vi bindings for long enough to give it a fair shake, and I haven't ever tried that.

As a related issue, my impression is that using Emacs bindings have become the default in basically anything that offers command line editing, even if they're not using readline at all and have reimplemented it from scratch. This provides its own momentum for sticking with Emacs bindings, since you're going to run into them sooner or later no matter how you set your shell et al.

EmacsForReadline written at 00:24:26; Add Comment

2017-07-11

The BSD r* commands and the history of privileged TCP ports

Once upon a time, UCB was adding TCP/IP to (BSD) Unix. They had multiple Unix machines, and one obvious thing to want when you have networking and multiple Unix machines is a way to log in and transfer files from one machine to another. Fortunately for the UCB BSD developers, TCP/IP already had well-specified programs (and protocols) to do this, namely telnet and FTP. So all they had to do was implement telnet and FTP clients and servers and they were done, right?

The USB BSD people did implement telnet and FTP, but they weren't satisfied with just that, apparently because neither was convenient and flexible enough. In particular, telnet and FTP have baked into them the requirement for a password. Obviously you need to ask people to authenticate themselves (with a password) when you're accepting remote connections from who knows where, but the UCB people were just logging in and transferring files and so on between their own local collection of Vaxes. So the BSD people wound up creating a better way, in the form of the r* commands: rlogin, rsh, and rcp. The daemons that implemented these were rlogind and rshd.

(See also rcmd(3).)

The authentication method was pretty simple; it relied on checking /etc/hosts.equiv to see if the client host was trusted in general, or ~/.rhosts to see if you were allowing logins from that particular remote user on the remote host. As part of this, these daemons obviously relied on the client not lying about who the remote user was (and what their login name was). How did they have some assurance about this? The answer is that the BSD developers added a hack to their TCP/UDP implementation, namely a new idea of 'privileged ports'.

Privileged ports were ports under 1024 (aka IPPORT_RESERVED), and the hack was that the kernel only allowed them to be used by UID 0. If you asked for 'give me any port', they were skipped, and if you weren't root and tried to bind(2) to such a port, the kernel rejected you. Rlogind and rshd insisted that client connections come from privileged ports, at which point they knew that it was a root-owned process talking to them from the client and they could trust its claims of what the remote login name was. Ordinary users on the client couldn't make their own connection and claim to be someone else, because they wouldn't be allowed to use a privileged port.

(Sun later reused this port-based authentication mechanism as part of NFS security.)

Based on the Unix versions available on tuhs.org, all of this appears to have been introduced in 4.1c BSD. This is the version that adds (IPPORT_RESERVED)) to netinet/in.h and has the TCP/UDP port binding code check it in in_pcbbind in netinet/in_pcb.c. In case you think the BSD people thought that this was an elegant idea, let me show you the code:

if (lport) {
  u_short aport = htons(lport);
  int wild = 0;

  /* GROSS */
  if (aport < IPPORT_RESERVED && u.u_uid != 0)
    return (EACCES);
  [...]

The BSD people knew this was a hack; they just did it anyway, probably because it was a very handy hack in their trusted local network environment. Unix has quietly inherited it ever since.

(Privileged ports are often called 'reserved ports', as in 'reserved for root only'. Even the 4.1c BSD usage here is inconsistent; the actual #define is IPPORT_RESERVED, but things like the rlogind manpage talk about 'privileged port numbers'. Interestingly, in 4.1c BSD the source code for the r* programs is hanging out in an odd place, in /usr/src/ucb/netser, along with several other things. By the release of 4.2 BSD, they had all been relocated to /usr/src/ucb and /usr/src/etc, where you'd expect.)

PS: Explicitly using a privileged port when connecting to a server is one of the rare cases when you need to call bind() for an outgoing socket, which is usually something you want to avoid.

Sidebar: The BUGS section of rlogind and rshd is wise

In the style of Unix manpages admitting big issues that they don't have any good way of dealing with at the moment, the manpages of both rlogind and rshd end with:

.SH BUGS
The authentication procedure used here assumes the
integrity of each client machine and the connecting
medium.  This is insecure, but is useful in an ``open''
environment.

.PP
A facility to allow all data exchanges to be encrypted
should be present.

These issues would stay for a decade or so, getting slowly more significant over time, until they were finally fixed by SSH in 1995. Well, starting in 1995, the switch wasn't exactly instant and even now the process of replacing rsh and rlogin is sort of ongoing (and also).

BSDRcmdsAndPrivPorts written at 00:38:16; Add Comment

2017-06-17

One reason you have a mysterious Unix file called 2 (or 1)

Suppose, one day, that you look at the ls of some directory and you notice that you have an odd file called '2' (just the digit). If you look at the contents of this file, it probably has nothing that's particularly odd looking; in fact, it likely looks like plausible output from a command you might have run.

Congratulations, you've almost certainly fallen victim to a simple typo, one that's easy to make in interactive shell usage and in Bourne shell scripts. Here it is:

echo hi  >&2
echo oop >2

The equivalent typo to create a file called 1 is very similar:

might-err 2>&1 | less
might-oop 2>1  | less

(The 1 files created this way are often empty, although not always, since many commands rarely produce anything on standard error.)

In each case, accidentally omitting the '&' in the redirection converts it from redirecting one file descriptor to another (for instance, forcing echo to report something to standard error) into a plain redirect-to-file redirection where the name of the file is your target file descriptor number.

Some of the time you'll notice the problem right away because you don't get output that you expect, but in other cases you may not notice for some time (or ever notice, if this was an interactive command and you just moved on after looking at the output as it was). Probably the easiest version of this typo to miss is in error messages in shell scripts:

if [ ! -f "$SOMETHING" ]; then
  echo "$0: missing file $SOMETHING" 1>2
  echo "$0: aborting" 1>&2
  exit 1
fi

You may never run the script in a way that triggers this error condition, and even if you do you may not realize (or remember) that you're supposed to get two error messages, not just the 'aborting' one.

(After we stumbled over such a file recently, I grep'd all of my scripts for '>2' and '>1'. I was relieved not to find any.)

(For more fun with redirection in the Bourne shell, see also how to pipe just standard error.)

ShellStderrRedirectionOops written at 23:58:57; Add Comment

(Previous 10 or go back to June 2017 at 2017/06/10)

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.