2012-12-12
fork()
and closing file descriptors
As I noted in Why fork()
is a good API, back in the
bad old days Unix had a problem of stray file descriptors leaking from
processes into commands that they ran (for example, rsh
used to gift
your shell process with any number of strays). In theory the obvious
way to solve this is to have code explicitly close all file descriptors
before it exec()
s something. In practice Unix has chosen to solve this
with a special flag on file descriptors, FD_CLOEXEC
, which causes
them to be automatically closed when the process exec()
s.
In that entry I mentioned that there was a good reason
for this alternate solution in practice. At the start of planning this
followup entry I had a nice story all put together in my head about why
this was so, involving thread-based concurrency races. Unfortunately
that story is wrong (although a closely related concurrency race story
is the reason for things like O_CLOEXEC
in Linux's open()
).
FD_CLOEXEC
is not necessary to deal with a concurrency race between
thread A creating a new file descriptor and thread B fork()
ing
and then exec()
ing in the child process, because the child's file
descriptors are frozen at the moment that it's created by fork()
(with a standard fork()
). It's perfectly safe for the child process
to manually close all stray open file descriptors in user-level code,
because no matter what it does thread A can never make new file
descriptors appear in the child process partway through this. Either
they're there at the start (and will get closed by the user-level code),
or they'll never be there at all.
There are, however, several practical reasons that FD_CLOEXEC
exists. First and foremost, it proved pragmatically easier to get
code (often library code) to set FD_CLOEXEC
than to get every bit
of code that did a fork()
and exec()
sequence to always clean up
file descriptors properly. It also means that you don't have to worry
about file descriptors being created in the child process in various
ways, especially by library code (which might be threaded code, for
extra fun). Finally, it deals with the problem that Unix has no API for
finding out what file descriptors your process has open, so your only
way of closing all stray file descriptors in user code is the brute
force approach of looping trying to close each one in turn (and on
modern Unixes, that can be a lot of potential file descriptors).
Once you have FD_CLOEXEC
and programs that assume they can use it
to just fork()
and exec()
, you have the thread races that lead you
to needing things like O_CLOEXEC
. Any time a file descriptor can
come into existence without FD_CLOEXEC
being set on it, you have
a race between thread A creating the file descriptor and then setting
FD_CLOEXEC
and thread B doing a fork()
and exec()
. If thread B
'wins' this race, it will inherit a new file descriptor that does not
have FD_CLOEXEC
set and this file descriptor will leak through the
exec()
.
(All of this is well known in the Unix programming community that pays attention to this stuff. I'm writing it down here so that I can get it straight and firmly fixed into my head, since I almost made an embarrassing mistake about it.)
One good use for default function arguments
When I wrote about a danger of default function arguments I mentioned that there are cases where they make sense and are useful. Today I'm going to present what I feel is one of them.
To put it simply, one use for default arguments is when you effectively have a bunch of slightly different APIs but it would be awkward to have different functions for them. Not uncommonly you might have too many function variants, the code would be too entwined, or both. Reasonably chosen default arguments effectively give you multiple APIs; one with all the defaults, another with one set of defaults overridden, and so on. You can in fact discover how many different APIs you actually need more or less on the fly, as you write code that uses different combinations of default and non-default arguments.
All of that sounds really abstract, so I'll use an actual example from DWiki. DWiki has a concept of 'views', which are both different ways to present the same underlying bit of the filesystem and different ways of processing URLs. Views have names and handler functions, and there is a registration function for them:
def register(name, factory, canGET = True, canPOST = False, onDir = False, onFile = True, pubDir = False, getParams = [], postParams = []): [....]
This is effectively several APIs in one. Fully expanded, I think it'd
be one API that's used to register forms (that's what canPOST
, a
False
value for canGET
, getParams
, and postParams
are for), one
API for views of directories only, one API for views of files only,
and one API for views that work on both files and directories. As
separate functions, each would have a subset of the full arguments for
register()
. But equally, as separate functions they would all do the
same thing and they'd have basically the same name (there is no natural
strong name difference between 'register a form view' and 'register a
directory view' and so on).
I dislike small variations of things (I'm driven to generalize), so when I was writing DWiki I didn't make separate functions for each API; instead I slammed them together into one function with a bunch of default arguments. The simplest case (with all arguments as defaults) corresponds to what I thought at the time was the most common case, or at least the base case.
(This view of default arguments creating multiple APIs comes from a bit in this talk on Go; reading it crystallized several things in my head.)