Wandering Thoughts archives


fork() and closing file descriptors

As I noted in Why fork() is a good API, back in the bad old days Unix had a problem of stray file descriptors leaking from processes into commands that they ran (for example, rsh used to gift your shell process with any number of strays). In theory the obvious way to solve this is to have code explicitly close all file descriptors before it exec()s something. In practice Unix has chosen to solve this with a special flag on file descriptors, FD_CLOEXEC, which causes them to be automatically closed when the process exec()s.

In that entry I mentioned that there was a good reason for this alternate solution in practice. At the start of planning this followup entry I had a nice story all put together in my head about why this was so, involving thread-based concurrency races. Unfortunately that story is wrong (although a closely related concurrency race story is the reason for things like O_CLOEXEC in Linux's open()).

FD_CLOEXEC is not necessary to deal with a concurrency race between thread A creating a new file descriptor and thread B fork()ing and then exec()ing in the child process, because the child's file descriptors are frozen at the moment that it's created by fork() (with a standard fork()). It's perfectly safe for the child process to manually close all stray open file descriptors in user-level code, because no matter what it does thread A can never make new file descriptors appear in the child process partway through this. Either they're there at the start (and will get closed by the user-level code), or they'll never be there at all.

There are, however, several practical reasons that FD_CLOEXEC exists. First and foremost, it proved pragmatically easier to get code (often library code) to set FD_CLOEXEC than to get every bit of code that did a fork() and exec() sequence to always clean up file descriptors properly. It also means that you don't have to worry about file descriptors being created in the child process in various ways, especially by library code (which might be threaded code, for extra fun). Finally, it deals with the problem that Unix has no API for finding out what file descriptors your process has open, so your only way of closing all stray file descriptors in user code is the brute force approach of looping trying to close each one in turn (and on modern Unixes, that can be a lot of potential file descriptors).

Once you have FD_CLOEXEC and programs that assume they can use it to just fork() and exec(), you have the thread races that lead you to needing things like O_CLOEXEC. Any time a file descriptor can come into existence without FD_CLOEXEC being set on it, you have a race between thread A creating the file descriptor and then setting FD_CLOEXEC and thread B doing a fork() and exec(). If thread B 'wins' this race, it will inherit a new file descriptor that does not have FD_CLOEXEC set and this file descriptor will leak through the exec().

(All of this is well known in the Unix programming community that pays attention to this stuff. I'm writing it down here so that I can get it straight and firmly fixed into my head, since I almost made an embarrassing mistake about it.)

unix/ForkFDsAndRaces written at 23:03:50; Add Comment

One good use for default function arguments

When I wrote about a danger of default function arguments I mentioned that there are cases where they make sense and are useful. Today I'm going to present what I feel is one of them.

To put it simply, one use for default arguments is when you effectively have a bunch of slightly different APIs but it would be awkward to have different functions for them. Not uncommonly you might have too many function variants, the code would be too entwined, or both. Reasonably chosen default arguments effectively give you multiple APIs; one with all the defaults, another with one set of defaults overridden, and so on. You can in fact discover how many different APIs you actually need more or less on the fly, as you write code that uses different combinations of default and non-default arguments.

All of that sounds really abstract, so I'll use an actual example from DWiki. DWiki has a concept of 'views', which are both different ways to present the same underlying bit of the filesystem and different ways of processing URLs. Views have names and handler functions, and there is a registration function for them:

def register(name, factory, canGET = True, canPOST = False,
             onDir = False, onFile = True, pubDir = False,
             getParams = [], postParams = []):

This is effectively several APIs in one. Fully expanded, I think it'd be one API that's used to register forms (that's what canPOST, a False value for canGET, getParams, and postParams are for), one API for views of directories only, one API for views of files only, and one API for views that work on both files and directories. As separate functions, each would have a subset of the full arguments for register(). But equally, as separate functions they would all do the same thing and they'd have basically the same name (there is no natural strong name difference between 'register a form view' and 'register a directory view' and so on).

I dislike small variations of things (I'm driven to generalize), so when I was writing DWiki I didn't make separate functions for each API; instead I slammed them together into one function with a bunch of default arguments. The simplest case (with all arguments as defaults) corresponds to what I thought at the time was the most common case, or at least the base case.

(This view of default arguments creating multiple APIs comes from a bit in this talk on Go; reading it crystallized several things in my head.)

python/DefaultArgumentsAsAPIs written at 00:46:15; Add Comment

Page tools: See As Normal.
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.