Wandering Thoughts archives

2015-07-17

Your standard input is a tty in a number of surprising cases

Every once in a while, someone writing a program decides that checking to see whether standard input is a tty (via isatty()) is a great way of determining 'am I being run interactively or not?'. This certainly sounds like a good way to do this check if you aren't familiar with Unix and don't actually test any number of situations, but in fact it is wrong almost all of the time.

For a start, this is wrong if your command is just being run in a shell script. Commands run from a shell script inherit the script's standard input; if you just ran the script itself from a shell command line, well, that's your tty. No Unix shell can behave differently because passing stdin to script commands is what lets shell scripts work in the middle of pipelines. But plain commands are the obvious case, so let's go for an odder one:

var=$(/some/command ....)

You guessed it: /some/command inherits the shell's standard input and thus may have its standard input connected to your tty. Its standard output is not a tty, of course; it's being collected by the shell instead.

Now let's talk about GNU Make. Plain commands in Makefiles are like plain commands in shell scripts; make gets your standard input and passes it to commands being run. In my opinion this is far less defensible than with shell scripts, although I'm sure someone has a setup that uses make and a Makefile in the middle of a pipeline and counts on the commands run from the Makefile being able to read standard input. Still, I suspect a certain number of people would be surprised by that.

GNU Make has a feature where it can run a shell command as it parses the Makefile in order to do things like set up the value of Makefile variables. This looks like (in the simple version):

AVAR := $(shell /some/command ...)

This too can have isatty(stdin) be true. Like the shell, GNU Make passes its standard input down even to things being run via command substitution.

The short form version of this is almost anything that's run even indirectly by a user from their shell prompt may have standard input being a tty. Run from a shell script that's run from three levels of Makefiles (and makes) that are started from a shell script that's spawned from a C program that does a system()? Unless there's a pipeline somewhere in there, you probably still have standard input connected to the user's tty.

It follows that checking isatty(stdin) is a terrible way of seeing whether or not your program is being run interactively, unless the version of 'interactively' you care about is whether you're being run from something that's totally detached from the user, like a crontab or a ssh remote command execution (possibly an automated one). Standard input not being a tty doesn't guarantee this, of course, but if standard input is a tty you can be pretty sure that you aren't being run from crontab et al.

(The corollary of this is that if you're writing shell scripts and so on, you may sometimes want to deliberately disconnect standard input from what it normally would be. This doesn't totally stop people from talking to the user (they can always explicitly open /dev/tty), but at least it makes it less likely to happen by more or less accident.)

StdinIsOftenATty written at 00:58:36; Add Comment

2015-07-13

My personal view of OpenBSD

I have nothing against OpenBSD in situations where it works well; we run it for firewalls and a few other narrow purposes which it does well at. But I have no love for it either and left to my own devices I probably wouldn't use it for anything. Certainly I can't imagine running OpenBSD on a personal machine.

Some of this is an extension of my pragmatic and technical views on FreeBSD versus Linux, with a bit of the cultural bad blood thrown in as well. Some of it is all of the sober, calm impacts of OpenBSD's culture, since I have good reasons not to run systems where I don't think I'm going to have very much fun trying to get support or help or report bugs. But that's the easy to write about and incomplete version.

The core truth is that I don't want to associate myself with the OpenBSD culture that I described. I no longer want to be anywhere near a community that is abrasive in general and hostile to newcomers (either openly or simply by being 'clever'), one where people abusing each other on mailing lists is a perfectly okay event, and so on. My hands are not clean here, because I have been one of those clever people in the past (and part of the appreciative audience of such clever people, too). But looking back at that part of my past conduct now mostly makes me wince. Today, I try to be better and do better.

(I'm not entirely consistent, given that Linux has its own issues with this. But I feel that they are less pervasive because Linux is a much more split up system; there is no one group of people that is the entire main system the way there is with OpenBSD.)

Even if I never experienced or even saw the dark side of OpenBSD, I would know that it was there. And these days I simply don't want to hang around that sort of a place; it's not something that I find pleasant any more. And in turn that taints OpenBSD itself, because it is the heart of that culture.

PS: I don't know if what I hear about OpenBSD's culture is actually true (or if it's still true). That's why I've called it folklore. But if it isn't true, well, the OpenBSD people have a problem, because it is very pervasive folklore (and does historically clearly have some basis in fact) and I'm not sure people are going to easily believe that it's false.

(Not that I expect that the people in the OpenBSD community care about this issue or my views. Rather the contrary; it would be surprising if they were not perfectly fine with the current state of their community, and maybe rather enjoy it just the way it is.)

MyOpenBSDView written at 00:38:19; Add Comment

2015-06-30

The probable and prosaic explanation for a socket() API choice

It started on Twitter:

@mjdominus: Annoyed today that the BSD people had socket(2) return a single FD instead of a pair the way pipe(2) does. That necessitated shutdown(2).

@thatcks: I suspect they might have felt forced to single-FD returns by per-process and total kernel-wide FD limits back then.

I came up with this idea off the cuff and it felt convincing at the moment that I tweeted it; after all, if you have a socket server or the like, such as inetd, moving to a two-FD model for sockets means that you've just more or less doubled the number of file descriptors your process needs. Today we're used to systems that let processes to have a lot of open file descriptors at once, but historically Unix had much lower limits and it's not hard to imagine inetd running into them.

It's a wonderful theory but it immediately runs aground on the practical reality that socket() and accept() were introduced no later than 4.1c BSD, while inetd only came in in 4.3 BSD (which was years later). Thus it seems very unlikely that the BSD developers were thinking ahead to processes that would open a lot of sockets at the time that the socket() API was designed. Instead I think that there are much simpler and more likely explanations for why the API isn't the way Mark Jason Dominus would like.

The first is that it seems clear that the BSD people were not particularly concerned about minimizing new system calls; instead BSD was already adding a ton of new system features and system calls. Between 4.0 BSD and 4.1c BSD, they went from 64 syscall table entries (not all of them real syscalls) to 149 entries. In this atmosphere, avoiding adding one more system call is not likely to have been a big motivator or in fact even very much on people's minds. Nor was networking the only source of additions; 4.1c BSD added rename(), mkdir(), and rmdir(), for example.

The second is that C makes multi-return APIs more awkward than single-return APIs. Contrast the pipe() API, where you must construct a memory area for the two file descriptors and pass a pointer to it, with the socket() API, where you simply assign the return value. Given a choice, I think a lot of people are going to design a socket()-style API rather than a pipe()-style API.

There's also the related issue that one reason the pipe() API works well returning two file descriptors is because the file descriptors involved almost immediately go in different 'directions' (often one goes to a sub-process); there aren't very many situations where you want to pass both file descriptors around to functions in your program. This is very much not the case in network related programs, especially programs that use select(); if socket() et al returned two file descriptors, one for read and one for write, I think that you'd find they were often passed around together. Often you'd prefer them to be one descriptor that you could use either for reading or writing depending on what you were doing at the time. Many classical network programs (and protocols) alternate reading and writing from the network, after all.

(Without processes that open multiple sockets, you might wonder what select() is there for. The answer is programs like telnet and rlogin (and their servers), which talk to both the network and the tty at the same time. These were already present in 4.1c BSD, at the dawn of the socket() API.)

Sidebar: The pipe() user API versus the kernel API

Before I actually looked at the 4.1c BSD kernel source code, I was also going to say that the kernel to user API makes returning more than one value awkward because your kernel code has to explicitly fish through the pointer that userland has supplied it in things like the pipe() system call. It turns out that this is false. Instead, as far back as V7 and probably further, the kernel to user API could return multiple values; specifically, it could return two values. pipe() used this to return both file descriptors without having to fish around in your user process memory, and it was up to the C library to write these two return values to your pipefd array.

I really should have expected this; in a kernel, no one wants to have to look at user process memory if they can help it. Returning two values instead of one just needs an extra register in the general assembly level syscall API and there you are.

SocketReturnAPIDesign written at 01:10:44; Add Comment

2015-06-29

BSD Unix developed over more time than I usually think

Left to myself, I tend to sloppily think of 4.2 BSD as where all of the major development of BSD Unix took place and the point in time where what we think of as 'BSD Unix' formed. Sure, there were BSDs before and after 4.2 BSD, but I think of the before releases as just the preliminaries and the releases after 4.2 BSD as just polishing and refining things a bit. As I was reminded today, this view is in fact wrong.

If you'd asked me what 4.x BSD release inetd first appeared in, I would have confidently told you that it had to have appeared in 4.2 BSD along with all of the other networking stuff. Inetd is such a pivotal bit of the BSD networking (along with the services that it enables, like finger) that of course it would be there from the start in 4.2, right?

Wrong. It turns out that inetd only seems to have appeared in 4.3 BSD. In fact a number of related bits of 4.2 BSD are surprisingly under-developed and different from what I think of as 'the BSD way'. Obviously, finger in 4.2 BSD is not network enabled, but a more fundamental thing is that 4.2 BSD limits processes to only 20 open file descriptors at once (by default, and comments in the source suggest that this cannot be raised above 30 no matter what).

Instead it is 4.3 BSD that introduced not just inetd but a higher limit on the number of open file descriptors (normally 64). With that higher limit came the modern FD_* set of macros used to set, check, and clear bits in the select() file descriptor bitmaps; 4.2 BSD didn't need these since the file descriptor masks fit into a single 32-bit word.

(I discovered this due to a Twitter conversation with Mark Jason Dominus. I now think my initial answer is almost certainly wrong, but that's going to be another entry.)

Sidebar: dup2() and BSD's low file descriptor limit

Given the existence of the dup2() system call, which in theory lets you create a file descriptor with any FD number, you might wonder how 4.2 BSD got away with a 32-bit word for the select() bitmask. The answer turns out to be that 4.2 BSD simply forbid you from dup2()'ing to a file descriptor number bigger than 19 (or in general the NOFILE constant).

(You can see the code for this in the dup2() implementation. In general a lot of the early Unix kernel source code is quite simple and readable, which is handy at times like this.)

BSDExtendedDevelopment written at 01:53:39; Add Comment

2015-06-22

Modern *BSDs have a much better init system than I was expecting

For a long time, the *BSDs (FreeBSD, OpenBSD, and NetBSD) had what was essentially the classical BSD init system, with all of its weaknesses. They made things a little bit simpler by having things like a configuration file where you could set whether standard daemons were started or not (and what arguments they got), instead of having to hand edit your /etc/rc, but that was about the extent of their niceness. When I started being involved with OpenBSD on our firewalls here, that was the 'BSD init system' that I got used to (to the extent that I had anything to do with it at all).

Well, guess what. While I wasn't looking, the *BSDs have introduced a much better system called rc.d. The rc.d system is basically a lightweight version of System V init; it strips out all of the runlevels, rcN.d directories, SNN and KNN symlinks, and so on to wind up with just shell scripts in /etc/rc.d and some additional support stuff.

As far as I can tell from some quick online research, this system originated in NetBSD back in 2001 or so (see the bottom). FreeBSD then adopted it in FreeBSD 5.0, released in January 2003, although they may not have pushed it widely initially (their Practical rc.d scripting in BSD has an initial copyright date of 2005). OpenBSD waited for quite a while (in the OpenBSD way), adopting it only in OpenBSD 4.9 (cf), which came out in May of 2011.

Of course what this really means is that I haven't looked into the state of modern *BSDs for quite a while. Specifically, I haven't looked into FreeBSD (I'm not interested in OpenBSD for anything except its specialist roles). For various reasons I haven't historically been interested in FreeBSD, so my vague impressions of it basically froze a long time ago. Clearly this is somewhat of a mistake and FreeBSD has moved well forward from what I naively expected. Ideally I should explore modern FreeBSD at some point.

(The trick with doing this is finding something real to use FreeBSD for. It's not going to be my desktop and it's probably not going to be any of our regular servers, although it's always possible that FreeBSD would be ideal for something and we just don't know it because we don't know FreeBSD.)

ModernBSDInitSurprise written at 01:46:52; Add Comment

2015-06-21

Why System V init's split scripts approach is better than classical BSD

Originally, Unix had very simple startup and shutdown processes. The System V init system modernized them, resulting in important improvements over the classical BSD one. Although I've discussed those improvements in passing, today I want to talk about why the general idea behind the System V init system is so important and useful.

The classical BSD approach to system init is that there are /etc/rc and /etc/rc.local shell scripts that are run on boot. All daemon starting and other boot time processing is done from one or the other. There is no special shutdown processing; to shut the machine down you just kill all of the processes (and then make a system call to actually reboot). This has the positive virtue that it's really simple, but it's got some drawbacks.

This approach works fine starting the system (orderly system shutdown was out of scope originally). It also works fine for restarting daemons, provided that your daemons are single process things that can easily be shut down with 'kill' and then restarted with more or less 'daemon &'. Initially this was the case in 4.xBSD, but as time went on and Unix vendors added complications like NFS, more and more things departed from this simple 'start a process; kill a process; start a process again' model of starting and restarting.

The moment people started to have more complicated startup and shutdown needs than 'kill' and 'daemon &', we started to have problems. Either you carefully memorized all of this stuff or you kept having to read /etc/rc to figure out what to do to restart or redo thing X. Does something need a multi-step startup? You're going to be entering those multiple steps yourself. Does something need you to kill four or five processes to shut it down properly? Get used to doing that, and don't forget one. All of this was a pain even in the best cases (which was single daemon processes that merely required the right magic command line arguments).

(In practice people not infrequently wrote their own scripts that did all of this work, then ran the scripts from /etc/rc or /etc/rc.local. But there was always a temptation to skip that step because after all your thing was so short, you could put it in directly.)

By contrast, the System V init approach of separate scripts puts that knowledge into reusable components. Need to stop or start or restart something? Just run '/etc/init.d/<whatever> <what>' and you're done. What the init.d scripts are called is small enough knowledge that you can probably keep it in your head, and if you forget it's usually easy enough to look it up with an ls.

(Separate scripts are also easier to manage than a single monolithic file.)

Of course you don't need the full complexity of System V init in order to realize these advantages. In fact, back in the long ago days when I dealt with a classical BSD init system I decided that the split scripts approach was such a big win that I was willing to manually split up /etc/rc into separate scripts just to get a rough approximation of it. The result was definitely worth the effort; it made my sysadmin life much easier.

(This manual split of much of /etc/rc is the partial init system I mentioned here.)

BSDInitSingleFileWeakness written at 02:05:13; Add Comment

2015-06-16

NFS writes and whether or not they're synchronous

In the original NFS v2, the situation with writes was relatively simple. The protocol specified that the server could only acknowledge write operations when it had committed them to disk, both for file data writes and for metadata operations such as creating files and directories, renaming files, and so on. Clients were free to buffer writes locally before sending them to the server and generally did, just as they buffered writes before sending them to local disks. As usual, when a client program did a sync() or a fsync(), this caused the client kernel to flush any locally buffered writes to the server, which would then commit them to disk and acknowledge them.

(You could sometimes tell clients not to do any local buffering and to immediately send all writes to the server, which theoretically resulted in no buffering anywhere.)

This worked and was simple (a big virtue in early NFS), but didn't really go very fast under a lot of circumstances. NFS server vendors did various things to speed writes up, from battery backed RAM on special cards to simply allowing the server to lie to clients about their data being on disk (which results in silent data loss if the server then loses that data, eg due to a power failure or abrupt reboot).

In NFS v3 the protocol was revised to add asynchronous writes and a new operation, COMMIT, to force the server to really flush your submitted asynchronous writes to disk. A NFS v3 server is permitted to lose submitted asynchronous writes up until you issue a successful COMMIT operation; this implies that the client must hang on to a copy of the written data so that it can resend it if needed. Of course, the server can start writing your data earlier if it wants to; it's up to the server. In addition clients can specify that their writes are synchronous, reverting NFS v3 back to the v2 behavior.

(See RFC 1813 for the gory details. It's actually surprisingly readable.)

In the simple case the client kernel will send a single COMMIT at the end of writing the file (for example, when your program closes it or fsync()s it). But if your program writes a large enough file, the client kernel won't want to buffer all of it in memory and so will start sending COMMIT operations to the server every so often so it can free up some of those write buffers. This can cause unexpected slowdowns under some circumstances, depending on a lot of factors.

(Note that just as with other forms of writeback disk IO, the client kernel may do these COMMITs asynchronously from your program's activity. Or it may opt to not try to be that clever and just force a synchronous COMMIT pause on your program every so often. There are arguments either way.)

If you write NFS v3 file data synchronously on the client, either by using O_SYNC or by appropriate NFS mount options, the client will not just immediately send it to the server without local buffering (the way it did in NFS v2), it will also insist that the server write it to disk synchronously. This means that forced synchronous client IO in NFS v3 causes a bigger change in performance than in NFS v2; basically you reduce NFS v3 down to NFS v2 end to end synchronous writes. You're not just eliminating client buffering, you're eliminating all buffering and increasing how many IOPs the server must do (well, compared to normal NFS v3 write IO).

All of this is just for file data writes. NFS v3 metadata operations are still just as synchronous as they were in NFS v2, so things like 'rm -rf' on a big source tree are just as slow as they used to be.

(I don't know enough about NFS v4 to know how it handles synchronous and asynchronous writes.)

NFSWritesAndSync written at 00:44:42; Add Comment

2015-06-15

My view of NFS protocol versions

There are three major versions of the NFS protocol that you may encounter or hear about, NFS v2, v3, and v4. Today I feel like running down my understanding of the broad and general differences between them.

NFS v2 is the original version of NFS. It dates from 1985 and boy does it show in the protocol. NFS v2 is obsolete today and should not be used, partly because it's a 32-bit protocol that doesn't allow access to large files. You might wonder why we care about NFS v2 in the modern era, and the answer to that is that a great deal of practical system administration folklore about NFS is based on NFS v2 behavior. Knowing what NFS v2 did can let you understand why people still often believe various things about NFS in general (or the NFS implementations on specific Unixes). NFS v2 was originally UDP only, although I think you can use it over TCP these days if you really want to.

NFS v3 is the 'modern' version, specified in 1995 and adopted steadily since then. Besides being 64-bit and so being able to deal with large files, it added a bunch of important performance improvements. Support for NFS over TCP was generally added (and made to work well) with NFS v3, although systems made it available for NFS v2 as well. NFS v3 is fundamentally the same as NFS v2; it could be described as 'NFS v2 with obvious tweaks'. NFS v2 environments could generally be easily moved to NFS v3 when the client and server support materialized and they'd generally see better performance.

For most people, the biggest performance difference between NFS v2 and NFS v3 is that in NFS v2 all writes are synchronous and in NFS v3 they're not necessarily so. This is a sufficiently complicated subject that it needs its own entry.

NFS v4 dates from the early 2000s and is a major change from previous versions of NFS. The core NFS protocol got much more complex (partly because it swallowed a number of what had previously been side protocols for things like mounting and locking) and a bunch of core assumptions changed. Most importantly for many people running real NFS servers (us included) is that NFS v4 is (primarily) designed to be a real distributed filesystem and is often described as requiring this. However you can apparently run it with traditional NFS 'we trust clients' security if you want and things may even work decently that way these days.

(NFS v4 is apparently not supported on OpenBSD, although it is on Linux, OmniOS, Solaris, and FreeBSD.)

Initial NFS v4 server implementations put various restrictions on how you could arrange your NFS exports; for example, they might have to all be located under a single directory on the server. Current NFS v4 server implementations on at least Linux and OmniOS seem to have removed this requirement, although writeups on the Internet hasn't necessarily caught up with this. As a result it's now common for such servers to export everything for both NFS v3 and NFS v4 if you don't do anything special.

My personal experience with NFS v4 is minimal. We very much don't want its security improvements and nothing else we've heard has sounded particularly compelling, so we run NFS v3. The few times I've wound up using NFS v4 it's been because a new out of the box server (still) allowed clients to do NFS v4 mounts, the clients defaulted to it, and the mounts had odd things going on with them that caused me to notice this. I suspect that we could make NFS v4 transparently equivalent to NFS v3 for us with more configuration work, but we haven't so far and I'm not sure we'd really get anything from it.

(Because I've primarily associated NFS v4 with its (undesired for us) security improvements (partly this is because what a lot of people talk about), I've historically had a bad view of it and modern NFS protocol development. This is probably a mild mistake by now.)

(Note that going to NFS v4 with AUTH_SYS authentication wouldn't get us around the 16 groups limitation.)

NFSVersionsView written at 01:51:06; Add Comment

2015-05-31

Unix has been bad before

These days it's popular to complain about the terrible state of software on modern Linux machines, with their tangle of opaque DBus services, weird Gnome (or KDE) software, and the requirement for all sorts of undocumented daemons to do anything. I've written a fair amount of entries like this myself. But make no mistake, Linux is not uniquely bad here and is not some terrible descent from a previous state of Unix desktop grace.

As I've alluded to before, the reality is that all of the old time Unix workstation vendors did all sorts of similarly terrible things themselves, back in the days when they were ongoing forces. No Unix desktop has ever been a neat and beautiful thing under the hood; all of them have been ugly and generally opaque conglomerations of wacky ideas. Sometimes these ideas spilled over into broader 'server' software and caused the expected heartburn in sysadmins there.

To the extent that the Unixes of the past were less terrible than the present, my view is that this is largely because old time Unix vendors were constrained by more limited hardware and software environments. Given modern RAM, CPUs, and graphics hardware and current software capabilities, they probably would have done things that are at least as bad as Linux systems are doing today. Instead, having only limited RAM and CPU power necessarily limited their ability to do really bad things (at least usually).

(One of the reasons that modern Linux stuff is better than it could otherwise be is that at least some of the people creating it have learned from the past and are thereby avoiding at least some of the mistakes people have already made.)

Also, while most of the terrible things have been confined to desktop Unix, not all of them were. Server Unix has seen its own share of past bad mistakes from various Unix vendors. Fortunately they tended to be smaller mistakes, if only because a lot of vendor effort was poured into desktops (well, most of the time; let's not talk about how the initial SunOS 4 releases ran on servers).

The large scale lesson I take from all of this is that Unix (as a whole) can and will recover from things that turn out to be mistakes. Sometimes it's a rocky road that's no fun during things, but we get there eventually.

UnixHasBeenBadBefore written at 21:53:41; Add Comment

2015-05-15

The pending delete problem for Unix filesystems

Unix has a number of somewhat annoying filesystem semantics that tend to irritate designers and implementors of filesystems. One of the famous ones is that you can delete a file without losing access to it. On at least some OSes, if your program open()s a file and then tries to delete it, either the deletion fails with 'file is in use' or you immediately lose access to the file; further attempts to read or write it will fail with some error. On Unix your program retains access to the deleted file and can even pass this access to other processes in various ways. Only when the last process using the file closes it will the file actually get deleted.

This 'use after deletion' presents Unix and filesystem designers with the problem of how you keep track of this in the kernel. The historical and generic kernel approach is to keep both a link count and a reference count for each active inode; an inode is only marked as unused and the filesystem told to free its space when both counts go to zero. Deleting a file via unlink() just lowers the link count (and removes a directory entry); closing open file descriptors is what lowers the reference count. This historical approach ignored the possibility of the system crashing while an inode had become unreachable through the filesystem and was only being kept alive by its reference count; if this happened the inode became a zombie, marked as active on disk but not referred to by anything. To fix it you had to run a filesystem checker, which would find such no-link inodes and actually deallocate them.

(When Sun introduced NFS they were forced to deviate slightly from this model, but that's an explanation for another time.)

Obviously this is not suitable for any sort of journaling or 'always consistent' filesystem that wants to avoid the need for a fsck after unclean shutdowns. All such filesystems must keep track of such 'deleted but not deallocated' files on disk using some mechanism (and the kernel has to support telling filesystems about such inodes). When the filesystem is unmounted in an orderly way, these deleted files will probably get deallocated. If the system crashes, part of bringing the filesystem up on boot will be to apply all of the pending deallocations.

Some filesystems will do this as part of their regular journal; you journal, say, 'file has gone to 0 reference count', and then you know to do the deallocation on journal replay. Some filesystems may record this information separately, especially if they have some sort of 'delayed asynchronous deallocation' support for file deletions in general.

(Asynchronous deallocation is popular because it means your process can unlink() a big file without having to stall while the kernel frantically runs around finding all of the file's data blocks and then marking them all as free. Given that finding out what a file's data blocks are often requires reading things from disk, such deallocations can be relatively slow under disk IO load (even if you don't have other issues there).)

PS: It follows that a failure to correctly record pending deallocations or properly replay them is one way to quietly lose disk space on such a journaling filesystem. Spotting and fixing this is one of the things that you need a filesystem consistency checker for (whether it's a separate program or embedded into the filesystem itself).

UnixPendingDeleteProblem written at 01:02:45; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.