2015-06-30
The probable and prosaic explanation for a socket() API choice
It started on Twitter:
@mjdominus: Annoyed today that the BSD people had socket(2) return a single FD instead of a pair the way pipe(2) does. That necessitated shutdown(2).
@thatcks: I suspect they might have felt forced to single-FD returns by per-process and total kernel-wide FD limits back then.
I came up with this idea off the cuff and it felt convincing at the
moment that I tweeted it; after all, if you have a socket server
or the like, such as inetd, moving to a two-FD model for sockets
means that you've just more or less doubled the number of file
descriptors your process needs. Today we're used to systems that
let processes to have a lot of open file descriptors at once, but
historically Unix had much lower limits and it's not hard to imagine
inetd running into them.
It's a wonderful theory but it immediately runs aground on the
practical reality that socket() and accept() were introduced
no later than 4.1c BSD, while inetd only came in in 4.3 BSD (which was years later). Thus it seems
very unlikely that the BSD developers were thinking ahead to processes
that would open a lot of sockets at the time that the socket()
API was designed. Instead I think that there are much simpler and
more likely explanations for why the API isn't the way Mark Jason
Dominus would like.
The first is that it seems clear that the BSD people were not
particularly concerned about minimizing new system calls; instead
BSD was already adding a ton of new system features and system
calls. Between 4.0 BSD and 4.1c BSD, they went from 64 syscall table
entries (not all of them real syscalls) to 149 entries. In this
atmosphere, avoiding adding one more system call is not likely to have
been a big motivator or in fact even very much on people's minds. Nor
was networking the only source of additions; 4.1c BSD added rename(),
mkdir(), and rmdir(), for example.
The second is that C makes multi-return APIs more awkward than
single-return APIs. Contrast the pipe() API, where you must construct
a memory area for the two file descriptors and pass a pointer to it,
with the socket() API, where you simply assign the return value. Given
a choice, I think a lot of people are going to design a socket()-style
API rather than a pipe()-style API.
There's also the related issue that one reason the pipe() API
works well returning two file descriptors is because the file
descriptors involved almost immediately go in different 'directions'
(often one goes to a sub-process); there aren't very many situations
where you want to pass both file descriptors around to functions
in your program. This is very much not the case in network related
programs, especially programs that use select(); if socket()
et al returned two file descriptors, one for read and one for write,
I think that you'd find they were often passed around together.
Often you'd prefer them to be one descriptor that you could use
either for reading or writing depending on what you were doing at
the time. Many classical network programs (and protocols) alternate
reading and writing from the network, after all.
(Without processes that open multiple sockets, you might wonder
what select() is there for. The answer is programs like telnet
and rlogin (and their servers), which talk to both the network
and the tty at the same time. These were already present in 4.1c
BSD, at the dawn of the socket() API.)
Sidebar: The pipe() user API versus the kernel API
Before I actually looked at the 4.1c BSD kernel source code, I was
also going to say that the kernel to user API makes returning more
than one value awkward because your kernel code has to explicitly
fish through the pointer that userland has supplied it in things
like the pipe() system call. It turns out that this is false.
Instead, as far back as V7 and
probably further, the kernel to user API could return multiple
values; specifically, it could return two values. pipe() used
this to return both file descriptors without having to fish around
in your user process memory, and it was up to the C library to write
these two return values to your pipefd array.
I really should have expected this; in a kernel, no one wants to have to look at user process memory if they can help it. Returning two values instead of one just needs an extra register in the general assembly level syscall API and there you are.
2015-06-29
BSD Unix developed over more time than I usually think
Left to myself, I tend to sloppily think of 4.2 BSD as where all of the major development of BSD Unix took place and the point in time where what we think of as 'BSD Unix' formed. Sure, there were BSDs before and after 4.2 BSD, but I think of the before releases as just the preliminaries and the releases after 4.2 BSD as just polishing and refining things a bit. As I was reminded today, this view is in fact wrong.
If you'd asked me what 4.x BSD release inetd first appeared in, I
would have confidently told you that it had to have appeared in 4.2 BSD
along with all of the other networking stuff. Inetd is such a pivotal
bit of the BSD networking (along with the services that it enables,
like finger) that of course it would be there from the start in 4.2,
right?
Wrong. It turns out that inetd only seems to have appeared in 4.3
BSD. In fact a number of related bits of 4.2 BSD are surprisingly
under-developed and different from what I think of as 'the BSD way'.
Obviously, finger in 4.2 BSD is not network enabled, but a more
fundamental thing is that 4.2 BSD limits processes to only 20 open
file descriptors at once (by default, and comments in the source
suggest that this cannot be raised above 30 no matter what).
Instead it is 4.3 BSD that introduced not just inetd but a higher
limit on the number of open file descriptors (normally 64).
With that higher limit came the modern FD_* set of macros used
to set, check, and clear bits in the select() file descriptor
bitmaps; 4.2 BSD didn't need these since the file descriptor masks
fit into a single 32-bit word.
(I discovered this due to a Twitter conversation with Mark Jason Dominus. I now think my initial answer is almost certainly wrong, but that's going to be another entry.)
Sidebar: dup2() and BSD's low file descriptor limit
Given the existence of the dup2() system call, which in theory
lets you create a file descriptor with any FD number, you might
wonder how 4.2 BSD got away with a 32-bit word for the select()
bitmask. The answer turns out to be that 4.2 BSD simply forbid you
from dup2()'ing to a file descriptor number bigger than 19 (or
in general the NOFILE constant).
(You can see the code for this in the dup2() implementation.
In general a lot of the early Unix kernel source code is quite simple
and readable, which is handy at times like this.)
2015-06-22
Modern *BSDs have a much better init system than I was expecting
For a long time, the *BSDs (FreeBSD, OpenBSD, and NetBSD) had what
was essentially the classical BSD init system, with all of its weaknesses. They made
things a little bit simpler by having things like a configuration
file where you could set whether standard daemons were started or
not (and what arguments they got), instead of having to hand edit
your /etc/rc, but that was about the extent of their niceness.
When I started being involved with OpenBSD on our firewalls here, that was the 'BSD init system' that
I got used to (to the extent that I had anything to do with it at
all).
Well, guess what. While I wasn't looking, the *BSDs have introduced
a much better system called rc.d. The rc.d system is basically
a lightweight version of System V init; it strips out all of the
runlevels, rcN.d directories, SNN and KNN symlinks, and so on to
wind up with just shell scripts in /etc/rc.d and some additional
support stuff.
As far as I can tell from some quick online research, this system originated in NetBSD back in 2001 or so (see the bottom). FreeBSD then adopted it in FreeBSD 5.0, released in January 2003, although they may not have pushed it widely initially (their Practical rc.d scripting in BSD has an initial copyright date of 2005). OpenBSD waited for quite a while (in the OpenBSD way), adopting it only in OpenBSD 4.9 (cf), which came out in May of 2011.
Of course what this really means is that I haven't looked into the state of modern *BSDs for quite a while. Specifically, I haven't looked into FreeBSD (I'm not interested in OpenBSD for anything except its specialist roles). For various reasons I haven't historically been interested in FreeBSD, so my vague impressions of it basically froze a long time ago. Clearly this is somewhat of a mistake and FreeBSD has moved well forward from what I naively expected. Ideally I should explore modern FreeBSD at some point.
(The trick with doing this is finding something real to use FreeBSD for. It's not going to be my desktop and it's probably not going to be any of our regular servers, although it's always possible that FreeBSD would be ideal for something and we just don't know it because we don't know FreeBSD.)
2015-06-21
Why System V init's split scripts approach is better than classical BSD
Originally, Unix had very simple startup and shutdown processes. The System V init system modernized them, resulting in important improvements over the classical BSD one. Although I've discussed those improvements in passing, today I want to talk about why the general idea behind the System V init system is so important and useful.
The classical BSD approach to system init is that there are /etc/rc
and /etc/rc.local shell scripts that are run on boot. All daemon
starting and other boot time processing is done from one or the
other. There is no special shutdown processing; to shut the machine
down you just kill all of the processes (and then make a system
call to actually reboot). This has the positive virtue that it's
really simple, but it's got some drawbacks.
This approach works fine starting the system (orderly system shutdown
was out of scope originally). It also works fine for restarting
daemons, provided that your daemons are single process things that
can easily be shut down with 'kill' and then restarted with more
or less 'daemon &'. Initially this was the case in 4.xBSD, but
as time went on and Unix vendors added complications like NFS, more
and more things departed from this simple 'start a process; kill a
process; start a process again' model of starting and restarting.
The moment people started to have more complicated startup and
shutdown needs than 'kill' and 'daemon &', we started to have
problems. Either you carefully memorized all of this stuff or you
kept having to read /etc/rc to figure out what to do to restart
or redo thing X. Does something need a multi-step startup? You're
going to be entering those multiple steps yourself. Does something
need you to kill four or five processes to shut it down properly?
Get used to doing that, and don't forget one. All of this was a
pain even in the best cases (which was single daemon processes that
merely required the right magic command line arguments).
(In practice people not infrequently wrote their own scripts that
did all of this work, then ran the scripts from /etc/rc or
/etc/rc.local. But there was always a temptation to skip that
step because after all your thing was so short, you could put it
in directly.)
By contrast, the System V init approach of separate scripts puts
that knowledge into reusable components. Need to stop or start or
restart something? Just run '/etc/init.d/<whatever> <what>' and
you're done. What the init.d scripts are called is small enough
knowledge that you can probably keep it in your head, and if you
forget it's usually easy enough to look it up with an ls.
(Separate scripts are also easier to manage than a single monolithic file.)
Of course you don't need the full complexity of System V init in
order to realize these advantages. In fact, back in the long ago
days when I dealt with a classical BSD init system I decided that
the split scripts approach was such a big win that I was willing
to manually split up /etc/rc into separate scripts just to get a
rough approximation of it. The result was definitely worth the
effort; it made my sysadmin life much easier.
(This manual split of much of /etc/rc is the partial init system
I mentioned here.)
2015-06-16
NFS writes and whether or not they're synchronous
In the original NFS v2, the situation with
writes was relatively simple. The protocol specified that the server
could only acknowledge write operations when it had committed them
to disk, both for file data writes and for metadata operations such
as creating files and directories, renaming files, and so on.
Clients were free to buffer writes locally before sending them to
the server and generally did, just as they buffered writes before
sending them to local disks. As usual, when a client program did
a sync() or a fsync(), this caused the client kernel to flush
any locally buffered writes to the server, which would then commit
them to disk and acknowledge them.
(You could sometimes tell clients not to do any local buffering and to immediately send all writes to the server, which theoretically resulted in no buffering anywhere.)
This worked and was simple (a big virtue in early NFS), but didn't really go very fast under a lot of circumstances. NFS server vendors did various things to speed writes up, from battery backed RAM on special cards to simply allowing the server to lie to clients about their data being on disk (which results in silent data loss if the server then loses that data, eg due to a power failure or abrupt reboot).
In NFS v3 the protocol was revised to add asynchronous writes and
a new operation, COMMIT, to force the server to really flush your
submitted asynchronous writes to disk. A NFS v3 server is permitted
to lose submitted asynchronous writes up until you issue a successful
COMMIT operation; this implies that the client must hang on to a
copy of the written data so that it can resend it if needed. Of
course, the server can start writing your data earlier if it wants
to; it's up to the server. In addition clients can specify that
their writes are synchronous, reverting NFS v3 back to the v2
behavior.
(See RFC 1813 for the gory details. It's actually surprisingly readable.)
In the simple case the client kernel will send a single COMMIT
at the end of writing the file (for example, when your program
closes it or fsync()s it). But if your program writes a large
enough file, the client kernel won't want to buffer all of it in
memory and so will start sending COMMIT operations to the server
every so often so it can free up some of those write buffers. This
can cause unexpected slowdowns under some circumstances, depending on a lot of factors.
(Note that just as with other forms of writeback disk IO, the client
kernel may do these COMMITs asynchronously from your program's
activity. Or it may opt to not try to be that clever and just force
a synchronous COMMIT pause on your program every so often. There
are arguments either way.)
If you write NFS v3 file data synchronously on the client, either
by using O_SYNC or by appropriate NFS mount options, the client
will not just immediately send it to the server without local
buffering (the way it did in NFS v2), it will also insist that the
server write it to disk synchronously. This means that forced
synchronous client IO in NFS v3 causes a bigger change in performance
than in NFS v2; basically you reduce NFS v3 down to NFS v2 end to
end synchronous writes. You're not just eliminating client buffering,
you're eliminating all buffering and increasing how many IOPs the
server must do (well, compared to normal NFS v3 write IO).
All of this is just for file data writes. NFS v3 metadata operations
are still just as synchronous as they were in NFS v2, so things
like 'rm -rf' on a big source tree are just as slow as they used
to be.
(I don't know enough about NFS v4 to know how it handles synchronous and asynchronous writes.)
2015-06-15
My view of NFS protocol versions
There are three major versions of the NFS protocol that you may encounter or hear about, NFS v2, v3, and v4. Today I feel like running down my understanding of the broad and general differences between them.
NFS v2 is the original version of NFS. It dates from 1985 and boy does it show in the protocol. NFS v2 is obsolete today and should not be used, partly because it's a 32-bit protocol that doesn't allow access to large files. You might wonder why we care about NFS v2 in the modern era, and the answer to that is that a great deal of practical system administration folklore about NFS is based on NFS v2 behavior. Knowing what NFS v2 did can let you understand why people still often believe various things about NFS in general (or the NFS implementations on specific Unixes). NFS v2 was originally UDP only, although I think you can use it over TCP these days if you really want to.
NFS v3 is the 'modern' version, specified in 1995 and adopted steadily since then. Besides being 64-bit and so being able to deal with large files, it added a bunch of important performance improvements. Support for NFS over TCP was generally added (and made to work well) with NFS v3, although systems made it available for NFS v2 as well. NFS v3 is fundamentally the same as NFS v2; it could be described as 'NFS v2 with obvious tweaks'. NFS v2 environments could generally be easily moved to NFS v3 when the client and server support materialized and they'd generally see better performance.
For most people, the biggest performance difference between NFS v2 and NFS v3 is that in NFS v2 all writes are synchronous and in NFS v3 they're not necessarily so. This is a sufficiently complicated subject that it needs its own entry.
NFS v4 dates from the early 2000s and is a major change from previous versions of NFS. The core NFS protocol got much more complex (partly because it swallowed a number of what had previously been side protocols for things like mounting and locking) and a bunch of core assumptions changed. Most importantly for many people running real NFS servers (us included) is that NFS v4 is (primarily) designed to be a real distributed filesystem and is often described as requiring this. However you can apparently run it with traditional NFS 'we trust clients' security if you want and things may even work decently that way these days.
(NFS v4 is apparently not supported on OpenBSD, although it is on Linux, OmniOS, Solaris, and FreeBSD.)
Initial NFS v4 server implementations put various restrictions on how you could arrange your NFS exports; for example, they might have to all be located under a single directory on the server. Current NFS v4 server implementations on at least Linux and OmniOS seem to have removed this requirement, although writeups on the Internet hasn't necessarily caught up with this. As a result it's now common for such servers to export everything for both NFS v3 and NFS v4 if you don't do anything special.
My personal experience with NFS v4 is minimal. We very much don't want its security improvements and nothing else we've heard has sounded particularly compelling, so we run NFS v3. The few times I've wound up using NFS v4 it's been because a new out of the box server (still) allowed clients to do NFS v4 mounts, the clients defaulted to it, and the mounts had odd things going on with them that caused me to notice this. I suspect that we could make NFS v4 transparently equivalent to NFS v3 for us with more configuration work, but we haven't so far and I'm not sure we'd really get anything from it.
(Because I've primarily associated NFS v4 with its (undesired for us) security improvements (partly this is because what a lot of people talk about), I've historically had a bad view of it and modern NFS protocol development. This is probably a mild mistake by now.)
(Note that going to NFS v4 with AUTH_SYS authentication wouldn't
get us around the 16 groups limitation.)