2015-07-17
Your standard input is a tty in a number of surprising cases
Every once in a while, someone writing a program decides that
checking to see whether standard input is a tty (via isatty())
is a great way of determining 'am I being run interactively or
not?'. This certainly sounds like a good way to do this check if
you aren't familiar with Unix and don't actually test any number
of situations, but in fact it is wrong almost all of the time.
For a start, this is wrong if your command is just being run in a shell script. Commands run from a shell script inherit the script's standard input; if you just ran the script itself from a shell command line, well, that's your tty. No Unix shell can behave differently because passing stdin to script commands is what lets shell scripts work in the middle of pipelines. But plain commands are the obvious case, so let's go for an odder one:
var=$(/some/command ....)
You guessed it: /some/command inherits the shell's standard input
and thus may have its standard input connected to your tty. Its
standard output is not a tty, of course; it's being collected by
the shell instead.
Now let's talk about GNU Make. Plain commands in Makefiles are like
plain commands in shell scripts; make gets your standard input
and passes it to commands being run. In my opinion this is far less
defensible than with shell scripts, although I'm sure someone has
a setup that uses make and a Makefile in the middle of a pipeline
and counts on the commands run from the Makefile being able to read
standard input. Still, I suspect a certain number of people would
be surprised by that.
GNU Make has a feature where it can run a shell command as it parses the Makefile in order to do things like set up the value of Makefile variables. This looks like (in the simple version):
AVAR := $(shell /some/command ...)
This too can have isatty(stdin) be true. Like the shell, GNU Make
passes its standard input down even to things being run via command
substitution.
The short form version of this is almost anything that's run even
indirectly by a user from their shell prompt may have standard input
being a tty. Run from a shell script that's run from three levels
of Makefiles (and makes) that are started from a shell script
that's spawned from a C program that does a system()? Unless
there's a pipeline somewhere in there, you probably still have
standard input connected to the user's tty.
It follows that checking isatty(stdin) is a terrible way of seeing
whether or not your program is being run interactively, unless the
version of 'interactively' you care about is whether you're being
run from something that's totally detached from the user, like a
crontab or a ssh remote command execution (possibly an automated
one). Standard input not being a tty doesn't guarantee this, of
course, but if standard input is a tty you can be pretty sure that
you aren't being run from crontab et al.
(The corollary of this is that if you're writing shell scripts and
so on, you may sometimes want to deliberately disconnect standard
input from what it normally would be. This doesn't totally stop
people from talking to the user (they can always explicitly open
/dev/tty), but at least it makes it less likely to happen by
more or less accident.)
2015-07-13
My personal view of OpenBSD
I have nothing against OpenBSD in situations where it works well; we run it for firewalls and a few other narrow purposes which it does well at. But I have no love for it either and left to my own devices I probably wouldn't use it for anything. Certainly I can't imagine running OpenBSD on a personal machine.
Some of this is an extension of my pragmatic and technical views on FreeBSD versus Linux, with a bit of the cultural bad blood thrown in as well. Some of it is all of the sober, calm impacts of OpenBSD's culture, since I have good reasons not to run systems where I don't think I'm going to have very much fun trying to get support or help or report bugs. But that's the easy to write about and incomplete version.
The core truth is that I don't want to associate myself with the OpenBSD culture that I described. I no longer want to be anywhere near a community that is abrasive in general and hostile to newcomers (either openly or simply by being 'clever'), one where people abusing each other on mailing lists is a perfectly okay event, and so on. My hands are not clean here, because I have been one of those clever people in the past (and part of the appreciative audience of such clever people, too). But looking back at that part of my past conduct now mostly makes me wince. Today, I try to be better and do better.
(I'm not entirely consistent, given that Linux has its own issues with this. But I feel that they are less pervasive because Linux is a much more split up system; there is no one group of people that is the entire main system the way there is with OpenBSD.)
Even if I never experienced or even saw the dark side of OpenBSD, I would know that it was there. And these days I simply don't want to hang around that sort of a place; it's not something that I find pleasant any more. And in turn that taints OpenBSD itself, because it is the heart of that culture.
PS: I don't know if what I hear about OpenBSD's culture is actually true (or if it's still true). That's why I've called it folklore. But if it isn't true, well, the OpenBSD people have a problem, because it is very pervasive folklore (and does historically clearly have some basis in fact) and I'm not sure people are going to easily believe that it's false.
(Not that I expect that the people in the OpenBSD community care about this issue or my views. Rather the contrary; it would be surprising if they were not perfectly fine with the current state of their community, and maybe rather enjoy it just the way it is.)
2015-06-30
The probable and prosaic explanation for a socket() API choice
It started on Twitter:
@mjdominus: Annoyed today that the BSD people had socket(2) return a single FD instead of a pair the way pipe(2) does. That necessitated shutdown(2).
@thatcks: I suspect they might have felt forced to single-FD returns by per-process and total kernel-wide FD limits back then.
I came up with this idea off the cuff and it felt convincing at the
moment that I tweeted it; after all, if you have a socket server
or the like, such as inetd, moving to a two-FD model for sockets
means that you've just more or less doubled the number of file
descriptors your process needs. Today we're used to systems that
let processes to have a lot of open file descriptors at once, but
historically Unix had much lower limits and it's not hard to imagine
inetd running into them.
It's a wonderful theory but it immediately runs aground on the
practical reality that socket() and accept() were introduced
no later than 4.1c BSD, while inetd only came in in 4.3 BSD (which was years later). Thus it seems
very unlikely that the BSD developers were thinking ahead to processes
that would open a lot of sockets at the time that the socket()
API was designed. Instead I think that there are much simpler and
more likely explanations for why the API isn't the way Mark Jason
Dominus would like.
The first is that it seems clear that the BSD people were not
particularly concerned about minimizing new system calls; instead
BSD was already adding a ton of new system features and system
calls. Between 4.0 BSD and 4.1c BSD, they went from 64 syscall table
entries (not all of them real syscalls) to 149 entries. In this
atmosphere, avoiding adding one more system call is not likely to have
been a big motivator or in fact even very much on people's minds. Nor
was networking the only source of additions; 4.1c BSD added rename(),
mkdir(), and rmdir(), for example.
The second is that C makes multi-return APIs more awkward than
single-return APIs. Contrast the pipe() API, where you must construct
a memory area for the two file descriptors and pass a pointer to it,
with the socket() API, where you simply assign the return value. Given
a choice, I think a lot of people are going to design a socket()-style
API rather than a pipe()-style API.
There's also the related issue that one reason the pipe() API
works well returning two file descriptors is because the file
descriptors involved almost immediately go in different 'directions'
(often one goes to a sub-process); there aren't very many situations
where you want to pass both file descriptors around to functions
in your program. This is very much not the case in network related
programs, especially programs that use select(); if socket()
et al returned two file descriptors, one for read and one for write,
I think that you'd find they were often passed around together.
Often you'd prefer them to be one descriptor that you could use
either for reading or writing depending on what you were doing at
the time. Many classical network programs (and protocols) alternate
reading and writing from the network, after all.
(Without processes that open multiple sockets, you might wonder
what select() is there for. The answer is programs like telnet
and rlogin (and their servers), which talk to both the network
and the tty at the same time. These were already present in 4.1c
BSD, at the dawn of the socket() API.)
Sidebar: The pipe() user API versus the kernel API
Before I actually looked at the 4.1c BSD kernel source code, I was
also going to say that the kernel to user API makes returning more
than one value awkward because your kernel code has to explicitly
fish through the pointer that userland has supplied it in things
like the pipe() system call. It turns out that this is false.
Instead, as far back as V7 and
probably further, the kernel to user API could return multiple
values; specifically, it could return two values. pipe() used
this to return both file descriptors without having to fish around
in your user process memory, and it was up to the C library to write
these two return values to your pipefd array.
I really should have expected this; in a kernel, no one wants to have to look at user process memory if they can help it. Returning two values instead of one just needs an extra register in the general assembly level syscall API and there you are.
2015-06-29
BSD Unix developed over more time than I usually think
Left to myself, I tend to sloppily think of 4.2 BSD as where all of the major development of BSD Unix took place and the point in time where what we think of as 'BSD Unix' formed. Sure, there were BSDs before and after 4.2 BSD, but I think of the before releases as just the preliminaries and the releases after 4.2 BSD as just polishing and refining things a bit. As I was reminded today, this view is in fact wrong.
If you'd asked me what 4.x BSD release inetd first appeared in, I
would have confidently told you that it had to have appeared in 4.2 BSD
along with all of the other networking stuff. Inetd is such a pivotal
bit of the BSD networking (along with the services that it enables,
like finger) that of course it would be there from the start in 4.2,
right?
Wrong. It turns out that inetd only seems to have appeared in 4.3
BSD. In fact a number of related bits of 4.2 BSD are surprisingly
under-developed and different from what I think of as 'the BSD way'.
Obviously, finger in 4.2 BSD is not network enabled, but a more
fundamental thing is that 4.2 BSD limits processes to only 20 open
file descriptors at once (by default, and comments in the source
suggest that this cannot be raised above 30 no matter what).
Instead it is 4.3 BSD that introduced not just inetd but a higher
limit on the number of open file descriptors (normally 64).
With that higher limit came the modern FD_* set of macros used
to set, check, and clear bits in the select() file descriptor
bitmaps; 4.2 BSD didn't need these since the file descriptor masks
fit into a single 32-bit word.
(I discovered this due to a Twitter conversation with Mark Jason Dominus. I now think my initial answer is almost certainly wrong, but that's going to be another entry.)
Sidebar: dup2() and BSD's low file descriptor limit
Given the existence of the dup2() system call, which in theory
lets you create a file descriptor with any FD number, you might
wonder how 4.2 BSD got away with a 32-bit word for the select()
bitmask. The answer turns out to be that 4.2 BSD simply forbid you
from dup2()'ing to a file descriptor number bigger than 19 (or
in general the NOFILE constant).
(You can see the code for this in the dup2() implementation.
In general a lot of the early Unix kernel source code is quite simple
and readable, which is handy at times like this.)
2015-06-22
Modern *BSDs have a much better init system than I was expecting
For a long time, the *BSDs (FreeBSD, OpenBSD, and NetBSD) had what
was essentially the classical BSD init system, with all of its weaknesses. They made
things a little bit simpler by having things like a configuration
file where you could set whether standard daemons were started or
not (and what arguments they got), instead of having to hand edit
your /etc/rc, but that was about the extent of their niceness.
When I started being involved with OpenBSD on our firewalls here, that was the 'BSD init system' that
I got used to (to the extent that I had anything to do with it at
all).
Well, guess what. While I wasn't looking, the *BSDs have introduced
a much better system called rc.d. The rc.d system is basically
a lightweight version of System V init; it strips out all of the
runlevels, rcN.d directories, SNN and KNN symlinks, and so on to
wind up with just shell scripts in /etc/rc.d and some additional
support stuff.
As far as I can tell from some quick online research, this system originated in NetBSD back in 2001 or so (see the bottom). FreeBSD then adopted it in FreeBSD 5.0, released in January 2003, although they may not have pushed it widely initially (their Practical rc.d scripting in BSD has an initial copyright date of 2005). OpenBSD waited for quite a while (in the OpenBSD way), adopting it only in OpenBSD 4.9 (cf), which came out in May of 2011.
Of course what this really means is that I haven't looked into the state of modern *BSDs for quite a while. Specifically, I haven't looked into FreeBSD (I'm not interested in OpenBSD for anything except its specialist roles). For various reasons I haven't historically been interested in FreeBSD, so my vague impressions of it basically froze a long time ago. Clearly this is somewhat of a mistake and FreeBSD has moved well forward from what I naively expected. Ideally I should explore modern FreeBSD at some point.
(The trick with doing this is finding something real to use FreeBSD for. It's not going to be my desktop and it's probably not going to be any of our regular servers, although it's always possible that FreeBSD would be ideal for something and we just don't know it because we don't know FreeBSD.)
2015-06-21
Why System V init's split scripts approach is better than classical BSD
Originally, Unix had very simple startup and shutdown processes. The System V init system modernized them, resulting in important improvements over the classical BSD one. Although I've discussed those improvements in passing, today I want to talk about why the general idea behind the System V init system is so important and useful.
The classical BSD approach to system init is that there are /etc/rc
and /etc/rc.local shell scripts that are run on boot. All daemon
starting and other boot time processing is done from one or the
other. There is no special shutdown processing; to shut the machine
down you just kill all of the processes (and then make a system
call to actually reboot). This has the positive virtue that it's
really simple, but it's got some drawbacks.
This approach works fine starting the system (orderly system shutdown
was out of scope originally). It also works fine for restarting
daemons, provided that your daemons are single process things that
can easily be shut down with 'kill' and then restarted with more
or less 'daemon &'. Initially this was the case in 4.xBSD, but
as time went on and Unix vendors added complications like NFS, more
and more things departed from this simple 'start a process; kill a
process; start a process again' model of starting and restarting.
The moment people started to have more complicated startup and
shutdown needs than 'kill' and 'daemon &', we started to have
problems. Either you carefully memorized all of this stuff or you
kept having to read /etc/rc to figure out what to do to restart
or redo thing X. Does something need a multi-step startup? You're
going to be entering those multiple steps yourself. Does something
need you to kill four or five processes to shut it down properly?
Get used to doing that, and don't forget one. All of this was a
pain even in the best cases (which was single daemon processes that
merely required the right magic command line arguments).
(In practice people not infrequently wrote their own scripts that
did all of this work, then ran the scripts from /etc/rc or
/etc/rc.local. But there was always a temptation to skip that
step because after all your thing was so short, you could put it
in directly.)
By contrast, the System V init approach of separate scripts puts
that knowledge into reusable components. Need to stop or start or
restart something? Just run '/etc/init.d/<whatever> <what>' and
you're done. What the init.d scripts are called is small enough
knowledge that you can probably keep it in your head, and if you
forget it's usually easy enough to look it up with an ls.
(Separate scripts are also easier to manage than a single monolithic file.)
Of course you don't need the full complexity of System V init in
order to realize these advantages. In fact, back in the long ago
days when I dealt with a classical BSD init system I decided that
the split scripts approach was such a big win that I was willing
to manually split up /etc/rc into separate scripts just to get a
rough approximation of it. The result was definitely worth the
effort; it made my sysadmin life much easier.
(This manual split of much of /etc/rc is the partial init system
I mentioned here.)
2015-06-16
NFS writes and whether or not they're synchronous
In the original NFS v2, the situation with
writes was relatively simple. The protocol specified that the server
could only acknowledge write operations when it had committed them
to disk, both for file data writes and for metadata operations such
as creating files and directories, renaming files, and so on.
Clients were free to buffer writes locally before sending them to
the server and generally did, just as they buffered writes before
sending them to local disks. As usual, when a client program did
a sync() or a fsync(), this caused the client kernel to flush
any locally buffered writes to the server, which would then commit
them to disk and acknowledge them.
(You could sometimes tell clients not to do any local buffering and to immediately send all writes to the server, which theoretically resulted in no buffering anywhere.)
This worked and was simple (a big virtue in early NFS), but didn't really go very fast under a lot of circumstances. NFS server vendors did various things to speed writes up, from battery backed RAM on special cards to simply allowing the server to lie to clients about their data being on disk (which results in silent data loss if the server then loses that data, eg due to a power failure or abrupt reboot).
In NFS v3 the protocol was revised to add asynchronous writes and
a new operation, COMMIT, to force the server to really flush your
submitted asynchronous writes to disk. A NFS v3 server is permitted
to lose submitted asynchronous writes up until you issue a successful
COMMIT operation; this implies that the client must hang on to a
copy of the written data so that it can resend it if needed. Of
course, the server can start writing your data earlier if it wants
to; it's up to the server. In addition clients can specify that
their writes are synchronous, reverting NFS v3 back to the v2
behavior.
(See RFC 1813 for the gory details. It's actually surprisingly readable.)
In the simple case the client kernel will send a single COMMIT
at the end of writing the file (for example, when your program
closes it or fsync()s it). But if your program writes a large
enough file, the client kernel won't want to buffer all of it in
memory and so will start sending COMMIT operations to the server
every so often so it can free up some of those write buffers. This
can cause unexpected slowdowns under some circumstances, depending on a lot of factors.
(Note that just as with other forms of writeback disk IO, the client
kernel may do these COMMITs asynchronously from your program's
activity. Or it may opt to not try to be that clever and just force
a synchronous COMMIT pause on your program every so often. There
are arguments either way.)
If you write NFS v3 file data synchronously on the client, either
by using O_SYNC or by appropriate NFS mount options, the client
will not just immediately send it to the server without local
buffering (the way it did in NFS v2), it will also insist that the
server write it to disk synchronously. This means that forced
synchronous client IO in NFS v3 causes a bigger change in performance
than in NFS v2; basically you reduce NFS v3 down to NFS v2 end to
end synchronous writes. You're not just eliminating client buffering,
you're eliminating all buffering and increasing how many IOPs the
server must do (well, compared to normal NFS v3 write IO).
All of this is just for file data writes. NFS v3 metadata operations
are still just as synchronous as they were in NFS v2, so things
like 'rm -rf' on a big source tree are just as slow as they used
to be.
(I don't know enough about NFS v4 to know how it handles synchronous and asynchronous writes.)
2015-06-15
My view of NFS protocol versions
There are three major versions of the NFS protocol that you may encounter or hear about, NFS v2, v3, and v4. Today I feel like running down my understanding of the broad and general differences between them.
NFS v2 is the original version of NFS. It dates from 1985 and boy does it show in the protocol. NFS v2 is obsolete today and should not be used, partly because it's a 32-bit protocol that doesn't allow access to large files. You might wonder why we care about NFS v2 in the modern era, and the answer to that is that a great deal of practical system administration folklore about NFS is based on NFS v2 behavior. Knowing what NFS v2 did can let you understand why people still often believe various things about NFS in general (or the NFS implementations on specific Unixes). NFS v2 was originally UDP only, although I think you can use it over TCP these days if you really want to.
NFS v3 is the 'modern' version, specified in 1995 and adopted steadily since then. Besides being 64-bit and so being able to deal with large files, it added a bunch of important performance improvements. Support for NFS over TCP was generally added (and made to work well) with NFS v3, although systems made it available for NFS v2 as well. NFS v3 is fundamentally the same as NFS v2; it could be described as 'NFS v2 with obvious tweaks'. NFS v2 environments could generally be easily moved to NFS v3 when the client and server support materialized and they'd generally see better performance.
For most people, the biggest performance difference between NFS v2 and NFS v3 is that in NFS v2 all writes are synchronous and in NFS v3 they're not necessarily so. This is a sufficiently complicated subject that it needs its own entry.
NFS v4 dates from the early 2000s and is a major change from previous versions of NFS. The core NFS protocol got much more complex (partly because it swallowed a number of what had previously been side protocols for things like mounting and locking) and a bunch of core assumptions changed. Most importantly for many people running real NFS servers (us included) is that NFS v4 is (primarily) designed to be a real distributed filesystem and is often described as requiring this. However you can apparently run it with traditional NFS 'we trust clients' security if you want and things may even work decently that way these days.
(NFS v4 is apparently not supported on OpenBSD, although it is on Linux, OmniOS, Solaris, and FreeBSD.)
Initial NFS v4 server implementations put various restrictions on how you could arrange your NFS exports; for example, they might have to all be located under a single directory on the server. Current NFS v4 server implementations on at least Linux and OmniOS seem to have removed this requirement, although writeups on the Internet hasn't necessarily caught up with this. As a result it's now common for such servers to export everything for both NFS v3 and NFS v4 if you don't do anything special.
My personal experience with NFS v4 is minimal. We very much don't want its security improvements and nothing else we've heard has sounded particularly compelling, so we run NFS v3. The few times I've wound up using NFS v4 it's been because a new out of the box server (still) allowed clients to do NFS v4 mounts, the clients defaulted to it, and the mounts had odd things going on with them that caused me to notice this. I suspect that we could make NFS v4 transparently equivalent to NFS v3 for us with more configuration work, but we haven't so far and I'm not sure we'd really get anything from it.
(Because I've primarily associated NFS v4 with its (undesired for us) security improvements (partly this is because what a lot of people talk about), I've historically had a bad view of it and modern NFS protocol development. This is probably a mild mistake by now.)
(Note that going to NFS v4 with AUTH_SYS authentication wouldn't
get us around the 16 groups limitation.)
2015-05-31
Unix has been bad before
These days it's popular to complain about the terrible state of software on modern Linux machines, with their tangle of opaque DBus services, weird Gnome (or KDE) software, and the requirement for all sorts of undocumented daemons to do anything. I've written a fair amount of entries like this myself. But make no mistake, Linux is not uniquely bad here and is not some terrible descent from a previous state of Unix desktop grace.
As I've alluded to before, the reality is that all of the old time Unix workstation vendors did all sorts of similarly terrible things themselves, back in the days when they were ongoing forces. No Unix desktop has ever been a neat and beautiful thing under the hood; all of them have been ugly and generally opaque conglomerations of wacky ideas. Sometimes these ideas spilled over into broader 'server' software and caused the expected heartburn in sysadmins there.
To the extent that the Unixes of the past were less terrible than the present, my view is that this is largely because old time Unix vendors were constrained by more limited hardware and software environments. Given modern RAM, CPUs, and graphics hardware and current software capabilities, they probably would have done things that are at least as bad as Linux systems are doing today. Instead, having only limited RAM and CPU power necessarily limited their ability to do really bad things (at least usually).
(One of the reasons that modern Linux stuff is better than it could otherwise be is that at least some of the people creating it have learned from the past and are thereby avoiding at least some of the mistakes people have already made.)
Also, while most of the terrible things have been confined to desktop Unix, not all of them were. Server Unix has seen its own share of past bad mistakes from various Unix vendors. Fortunately they tended to be smaller mistakes, if only because a lot of vendor effort was poured into desktops (well, most of the time; let's not talk about how the initial SunOS 4 releases ran on servers).
The large scale lesson I take from all of this is that Unix (as a whole) can and will recover from things that turn out to be mistakes. Sometimes it's a rocky road that's no fun during things, but we get there eventually.
2015-05-15
The pending delete problem for Unix filesystems
Unix has a number of somewhat annoying filesystem semantics that
tend to irritate designers and implementors of filesystems. One of
the famous ones is that you can delete a file without losing access
to it. On at least some OSes, if your program open()s a file and
then tries to delete it, either the deletion fails with 'file is
in use' or you immediately lose access to the file; further attempts
to read or write it will fail with some error. On Unix your program
retains access to the deleted file and can even pass this access
to other processes in various ways. Only when the last process using
the file closes it will the file actually get deleted.
This 'use after deletion' presents Unix and filesystem designers
with the problem of how you keep track of this in the kernel. The
historical and generic kernel approach is to keep both a link count
and a reference count for each active inode; an inode is only marked
as unused and the filesystem told to free its space when both counts
go to zero. Deleting a file via unlink() just lowers the link
count (and removes a directory entry); closing open file descriptors
is what lowers the reference count. This historical approach ignored
the possibility of the system crashing while an inode had become
unreachable through the filesystem and was only being kept alive
by its reference count; if this happened the inode became a zombie,
marked as active on disk but not referred to by anything. To fix
it you had to run a filesystem checker, which would
find such no-link inodes and actually deallocate them.
(When Sun introduced NFS they were forced to deviate slightly from this model, but that's an explanation for another time.)
Obviously this is not suitable for any sort of journaling or 'always
consistent' filesystem that wants to avoid the need for a fsck
after unclean shutdowns. All such filesystems must keep track of
such 'deleted but not deallocated' files on disk using some mechanism
(and the kernel has to support telling filesystems about such
inodes). When the filesystem is unmounted in an orderly way, these
deleted files will probably get deallocated. If the system crashes,
part of bringing the filesystem up on boot will be to apply all of
the pending deallocations.
Some filesystems will do this as part of their regular journal; you journal, say, 'file has gone to 0 reference count', and then you know to do the deallocation on journal replay. Some filesystems may record this information separately, especially if they have some sort of 'delayed asynchronous deallocation' support for file deletions in general.
(Asynchronous deallocation is popular because it means your process
can unlink() a big file without having to stall while the kernel
frantically runs around finding all of the file's data blocks and
then marking them all as free. Given that finding out what a file's
data blocks are often requires reading things from disk, such deallocations can be relatively
slow under disk IO load (even if you don't have other issues there).)
PS: It follows that a failure to correctly record pending deallocations or properly replay them is one way to quietly lose disk space on such a journaling filesystem. Spotting and fixing this is one of the things that you need a filesystem consistency checker for (whether it's a separate program or embedded into the filesystem itself).