Wandering Thoughts archives

2015-06-30

The probable and prosaic explanation for a socket() API choice

It started on Twitter:

@mjdominus: Annoyed today that the BSD people had socket(2) return a single FD instead of a pair the way pipe(2) does. That necessitated shutdown(2).

@thatcks: I suspect they might have felt forced to single-FD returns by per-process and total kernel-wide FD limits back then.

I came up with this idea off the cuff and it felt convincing at the moment that I tweeted it; after all, if you have a socket server or the like, such as inetd, moving to a two-FD model for sockets means that you've just more or less doubled the number of file descriptors your process needs. Today we're used to systems that let processes to have a lot of open file descriptors at once, but historically Unix had much lower limits and it's not hard to imagine inetd running into them.

It's a wonderful theory but it immediately runs aground on the practical reality that socket() and accept() were introduced no later than 4.1c BSD, while inetd only came in in 4.3 BSD (which was years later). Thus it seems very unlikely that the BSD developers were thinking ahead to processes that would open a lot of sockets at the time that the socket() API was designed. Instead I think that there are much simpler and more likely explanations for why the API isn't the way Mark Jason Dominus would like.

The first is that it seems clear that the BSD people were not particularly concerned about minimizing new system calls; instead BSD was already adding a ton of new system features and system calls. Between 4.0 BSD and 4.1c BSD, they went from 64 syscall table entries (not all of them real syscalls) to 149 entries. In this atmosphere, avoiding adding one more system call is not likely to have been a big motivator or in fact even very much on people's minds. Nor was networking the only source of additions; 4.1c BSD added rename(), mkdir(), and rmdir(), for example.

The second is that C makes multi-return APIs more awkward than single-return APIs. Contrast the pipe() API, where you must construct a memory area for the two file descriptors and pass a pointer to it, with the socket() API, where you simply assign the return value. Given a choice, I think a lot of people are going to design a socket()-style API rather than a pipe()-style API.

There's also the related issue that one reason the pipe() API works well returning two file descriptors is because the file descriptors involved almost immediately go in different 'directions' (often one goes to a sub-process); there aren't very many situations where you want to pass both file descriptors around to functions in your program. This is very much not the case in network related programs, especially programs that use select(); if socket() et al returned two file descriptors, one for read and one for write, I think that you'd find they were often passed around together. Often you'd prefer them to be one descriptor that you could use either for reading or writing depending on what you were doing at the time. Many classical network programs (and protocols) alternate reading and writing from the network, after all.

(Without processes that open multiple sockets, you might wonder what select() is there for. The answer is programs like telnet and rlogin (and their servers), which talk to both the network and the tty at the same time. These were already present in 4.1c BSD, at the dawn of the socket() API.)

Sidebar: The pipe() user API versus the kernel API

Before I actually looked at the 4.1c BSD kernel source code, I was also going to say that the kernel to user API makes returning more than one value awkward because your kernel code has to explicitly fish through the pointer that userland has supplied it in things like the pipe() system call. It turns out that this is false. Instead, as far back as V7 and probably further, the kernel to user API could return multiple values; specifically, it could return two values. pipe() used this to return both file descriptors without having to fish around in your user process memory, and it was up to the C library to write these two return values to your pipefd array.

I really should have expected this; in a kernel, no one wants to have to look at user process memory if they can help it. Returning two values instead of one just needs an extra register in the general assembly level syscall API and there you are.

SocketReturnAPIDesign written at 01:10:44; Add Comment

2015-06-29

BSD Unix developed over more time than I usually think

Left to myself, I tend to sloppily think of 4.2 BSD as where all of the major development of BSD Unix took place and the point in time where what we think of as 'BSD Unix' formed. Sure, there were BSDs before and after 4.2 BSD, but I think of the before releases as just the preliminaries and the releases after 4.2 BSD as just polishing and refining things a bit. As I was reminded today, this view is in fact wrong.

If you'd asked me what 4.x BSD release inetd first appeared in, I would have confidently told you that it had to have appeared in 4.2 BSD along with all of the other networking stuff. Inetd is such a pivotal bit of the BSD networking (along with the services that it enables, like finger) that of course it would be there from the start in 4.2, right?

Wrong. It turns out that inetd only seems to have appeared in 4.3 BSD. In fact a number of related bits of 4.2 BSD are surprisingly under-developed and different from what I think of as 'the BSD way'. Obviously, finger in 4.2 BSD is not network enabled, but a more fundamental thing is that 4.2 BSD limits processes to only 20 open file descriptors at once (by default, and comments in the source suggest that this cannot be raised above 30 no matter what).

Instead it is 4.3 BSD that introduced not just inetd but a higher limit on the number of open file descriptors (normally 64). With that higher limit came the modern FD_* set of macros used to set, check, and clear bits in the select() file descriptor bitmaps; 4.2 BSD didn't need these since the file descriptor masks fit into a single 32-bit word.

(I discovered this due to a Twitter conversation with Mark Jason Dominus. I now think my initial answer is almost certainly wrong, but that's going to be another entry.)

Sidebar: dup2() and BSD's low file descriptor limit

Given the existence of the dup2() system call, which in theory lets you create a file descriptor with any FD number, you might wonder how 4.2 BSD got away with a 32-bit word for the select() bitmask. The answer turns out to be that 4.2 BSD simply forbid you from dup2()'ing to a file descriptor number bigger than 19 (or in general the NOFILE constant).

(You can see the code for this in the dup2() implementation. In general a lot of the early Unix kernel source code is quite simple and readable, which is handy at times like this.)

BSDExtendedDevelopment written at 01:53:39; Add Comment

2015-06-22

Modern *BSDs have a much better init system than I was expecting

For a long time, the *BSDs (FreeBSD, OpenBSD, and NetBSD) had what was essentially the classical BSD init system, with all of its weaknesses. They made things a little bit simpler by having things like a configuration file where you could set whether standard daemons were started or not (and what arguments they got), instead of having to hand edit your /etc/rc, but that was about the extent of their niceness. When I started being involved with OpenBSD on our firewalls here, that was the 'BSD init system' that I got used to (to the extent that I had anything to do with it at all).

Well, guess what. While I wasn't looking, the *BSDs have introduced a much better system called rc.d. The rc.d system is basically a lightweight version of System V init; it strips out all of the runlevels, rcN.d directories, SNN and KNN symlinks, and so on to wind up with just shell scripts in /etc/rc.d and some additional support stuff.

As far as I can tell from some quick online research, this system originated in NetBSD back in 2001 or so (see the bottom). FreeBSD then adopted it in FreeBSD 5.0, released in January 2003, although they may not have pushed it widely initially (their Practical rc.d scripting in BSD has an initial copyright date of 2005). OpenBSD waited for quite a while (in the OpenBSD way), adopting it only in OpenBSD 4.9 (cf), which came out in May of 2011.

Of course what this really means is that I haven't looked into the state of modern *BSDs for quite a while. Specifically, I haven't looked into FreeBSD (I'm not interested in OpenBSD for anything except its specialist roles). For various reasons I haven't historically been interested in FreeBSD, so my vague impressions of it basically froze a long time ago. Clearly this is somewhat of a mistake and FreeBSD has moved well forward from what I naively expected. Ideally I should explore modern FreeBSD at some point.

(The trick with doing this is finding something real to use FreeBSD for. It's not going to be my desktop and it's probably not going to be any of our regular servers, although it's always possible that FreeBSD would be ideal for something and we just don't know it because we don't know FreeBSD.)

ModernBSDInitSurprise written at 01:46:52; Add Comment

2015-06-21

Why System V init's split scripts approach is better than classical BSD

Originally, Unix had very simple startup and shutdown processes. The System V init system modernized them, resulting in important improvements over the classical BSD one. Although I've discussed those improvements in passing, today I want to talk about why the general idea behind the System V init system is so important and useful.

The classical BSD approach to system init is that there are /etc/rc and /etc/rc.local shell scripts that are run on boot. All daemon starting and other boot time processing is done from one or the other. There is no special shutdown processing; to shut the machine down you just kill all of the processes (and then make a system call to actually reboot). This has the positive virtue that it's really simple, but it's got some drawbacks.

This approach works fine starting the system (orderly system shutdown was out of scope originally). It also works fine for restarting daemons, provided that your daemons are single process things that can easily be shut down with 'kill' and then restarted with more or less 'daemon &'. Initially this was the case in 4.xBSD, but as time went on and Unix vendors added complications like NFS, more and more things departed from this simple 'start a process; kill a process; start a process again' model of starting and restarting.

The moment people started to have more complicated startup and shutdown needs than 'kill' and 'daemon &', we started to have problems. Either you carefully memorized all of this stuff or you kept having to read /etc/rc to figure out what to do to restart or redo thing X. Does something need a multi-step startup? You're going to be entering those multiple steps yourself. Does something need you to kill four or five processes to shut it down properly? Get used to doing that, and don't forget one. All of this was a pain even in the best cases (which was single daemon processes that merely required the right magic command line arguments).

(In practice people not infrequently wrote their own scripts that did all of this work, then ran the scripts from /etc/rc or /etc/rc.local. But there was always a temptation to skip that step because after all your thing was so short, you could put it in directly.)

By contrast, the System V init approach of separate scripts puts that knowledge into reusable components. Need to stop or start or restart something? Just run '/etc/init.d/<whatever> <what>' and you're done. What the init.d scripts are called is small enough knowledge that you can probably keep it in your head, and if you forget it's usually easy enough to look it up with an ls.

(Separate scripts are also easier to manage than a single monolithic file.)

Of course you don't need the full complexity of System V init in order to realize these advantages. In fact, back in the long ago days when I dealt with a classical BSD init system I decided that the split scripts approach was such a big win that I was willing to manually split up /etc/rc into separate scripts just to get a rough approximation of it. The result was definitely worth the effort; it made my sysadmin life much easier.

(This manual split of much of /etc/rc is the partial init system I mentioned here.)

BSDInitSingleFileWeakness written at 02:05:13; Add Comment

2015-06-16

NFS writes and whether or not they're synchronous

In the original NFS v2, the situation with writes was relatively simple. The protocol specified that the server could only acknowledge write operations when it had committed them to disk, both for file data writes and for metadata operations such as creating files and directories, renaming files, and so on. Clients were free to buffer writes locally before sending them to the server and generally did, just as they buffered writes before sending them to local disks. As usual, when a client program did a sync() or a fsync(), this caused the client kernel to flush any locally buffered writes to the server, which would then commit them to disk and acknowledge them.

(You could sometimes tell clients not to do any local buffering and to immediately send all writes to the server, which theoretically resulted in no buffering anywhere.)

This worked and was simple (a big virtue in early NFS), but didn't really go very fast under a lot of circumstances. NFS server vendors did various things to speed writes up, from battery backed RAM on special cards to simply allowing the server to lie to clients about their data being on disk (which results in silent data loss if the server then loses that data, eg due to a power failure or abrupt reboot).

In NFS v3 the protocol was revised to add asynchronous writes and a new operation, COMMIT, to force the server to really flush your submitted asynchronous writes to disk. A NFS v3 server is permitted to lose submitted asynchronous writes up until you issue a successful COMMIT operation; this implies that the client must hang on to a copy of the written data so that it can resend it if needed. Of course, the server can start writing your data earlier if it wants to; it's up to the server. In addition clients can specify that their writes are synchronous, reverting NFS v3 back to the v2 behavior.

(See RFC 1813 for the gory details. It's actually surprisingly readable.)

In the simple case the client kernel will send a single COMMIT at the end of writing the file (for example, when your program closes it or fsync()s it). But if your program writes a large enough file, the client kernel won't want to buffer all of it in memory and so will start sending COMMIT operations to the server every so often so it can free up some of those write buffers. This can cause unexpected slowdowns under some circumstances, depending on a lot of factors.

(Note that just as with other forms of writeback disk IO, the client kernel may do these COMMITs asynchronously from your program's activity. Or it may opt to not try to be that clever and just force a synchronous COMMIT pause on your program every so often. There are arguments either way.)

If you write NFS v3 file data synchronously on the client, either by using O_SYNC or by appropriate NFS mount options, the client will not just immediately send it to the server without local buffering (the way it did in NFS v2), it will also insist that the server write it to disk synchronously. This means that forced synchronous client IO in NFS v3 causes a bigger change in performance than in NFS v2; basically you reduce NFS v3 down to NFS v2 end to end synchronous writes. You're not just eliminating client buffering, you're eliminating all buffering and increasing how many IOPs the server must do (well, compared to normal NFS v3 write IO).

All of this is just for file data writes. NFS v3 metadata operations are still just as synchronous as they were in NFS v2, so things like 'rm -rf' on a big source tree are just as slow as they used to be.

(I don't know enough about NFS v4 to know how it handles synchronous and asynchronous writes.)

NFSWritesAndSync written at 00:44:42; Add Comment

2015-06-15

My view of NFS protocol versions

There are three major versions of the NFS protocol that you may encounter or hear about, NFS v2, v3, and v4. Today I feel like running down my understanding of the broad and general differences between them.

NFS v2 is the original version of NFS. It dates from 1985 and boy does it show in the protocol. NFS v2 is obsolete today and should not be used, partly because it's a 32-bit protocol that doesn't allow access to large files. You might wonder why we care about NFS v2 in the modern era, and the answer to that is that a great deal of practical system administration folklore about NFS is based on NFS v2 behavior. Knowing what NFS v2 did can let you understand why people still often believe various things about NFS in general (or the NFS implementations on specific Unixes). NFS v2 was originally UDP only, although I think you can use it over TCP these days if you really want to.

NFS v3 is the 'modern' version, specified in 1995 and adopted steadily since then. Besides being 64-bit and so being able to deal with large files, it added a bunch of important performance improvements. Support for NFS over TCP was generally added (and made to work well) with NFS v3, although systems made it available for NFS v2 as well. NFS v3 is fundamentally the same as NFS v2; it could be described as 'NFS v2 with obvious tweaks'. NFS v2 environments could generally be easily moved to NFS v3 when the client and server support materialized and they'd generally see better performance.

For most people, the biggest performance difference between NFS v2 and NFS v3 is that in NFS v2 all writes are synchronous and in NFS v3 they're not necessarily so. This is a sufficiently complicated subject that it needs its own entry.

NFS v4 dates from the early 2000s and is a major change from previous versions of NFS. The core NFS protocol got much more complex (partly because it swallowed a number of what had previously been side protocols for things like mounting and locking) and a bunch of core assumptions changed. Most importantly for many people running real NFS servers (us included) is that NFS v4 is (primarily) designed to be a real distributed filesystem and is often described as requiring this. However you can apparently run it with traditional NFS 'we trust clients' security if you want and things may even work decently that way these days.

(NFS v4 is apparently not supported on OpenBSD, although it is on Linux, OmniOS, Solaris, and FreeBSD.)

Initial NFS v4 server implementations put various restrictions on how you could arrange your NFS exports; for example, they might have to all be located under a single directory on the server. Current NFS v4 server implementations on at least Linux and OmniOS seem to have removed this requirement, although writeups on the Internet hasn't necessarily caught up with this. As a result it's now common for such servers to export everything for both NFS v3 and NFS v4 if you don't do anything special.

My personal experience with NFS v4 is minimal. We very much don't want its security improvements and nothing else we've heard has sounded particularly compelling, so we run NFS v3. The few times I've wound up using NFS v4 it's been because a new out of the box server (still) allowed clients to do NFS v4 mounts, the clients defaulted to it, and the mounts had odd things going on with them that caused me to notice this. I suspect that we could make NFS v4 transparently equivalent to NFS v3 for us with more configuration work, but we haven't so far and I'm not sure we'd really get anything from it.

(Because I've primarily associated NFS v4 with its (undesired for us) security improvements (partly this is because what a lot of people talk about), I've historically had a bad view of it and modern NFS protocol development. This is probably a mild mistake by now.)

(Note that going to NFS v4 with AUTH_SYS authentication wouldn't get us around the 16 groups limitation.)

NFSVersionsView written at 01:51:06; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.