The constraints shaping kernel APIs
I was recently reading API Design Matters, which uses the .NET
socket select() function as an example of a sub-par API. As part
of thinking about how it wound up with that API, one of the things
I wound up mulling over is the constraints that generally shape
kernel APIs (as this .NET API is ultimately descended from the Unix
The first big constraint on system calls is that the actual call itself cannot allocate memory in your process under essentially any circumstances. The direct result is that all data needs to be returned in preallocated buffers that are passed (directly or indirectly) to the kernel as part of the call. The indirect result is that kernel APIs are biased very strongly towards needing buffers that have predictable, constant sizes. A kernel API that needs a highly variable-sized output buffer is very awkward to work with; generally either you over-allocate for most cases in order to have room for the worst case or you iterate the system call at least twice in order to determine and then provide a right-sized buffer.
The other big constraint is that historically, kernel implementors prefer to do as little writing to user space in general as they can get away with. From their perspective the best system call API is one that simply returns results in a register, then one that puts a result or two in a single memory location or two, and the worst is one that requires them to splat various things all over your user memory space. It's not hard to see why this is; unlike in a library, writing things to user space requires distinct and separate work for everything the code writes (even if this is usually wrapped up in function calls and macros). This has driven kernel APIs to return the minimal information required and leave it up to user space to work out everything from there (either in a library or in your code).
(Another constraint worth mentioning is that the general system call API often makes it much easier to do calls with a small number of arguments than calls with lots of arguments. Small numbers of arguments can often be passed directly in registers, while lots of arguments can require quite involved conventions and extra work for the kernel to dig out of user space.)
We can see these constraints at work in the
select() overwrites its inputs because that's clearly
the simplest place to put the output data, never mind that this is
massively inconvenient for the common case of repeated
on the same set of file descriptors. Anything else would require extra
buffers and extra arguments to the system call. If the system call
returned convenient extra information (such as how many of each sort of
file descriptor were active), that would require extra writes to user
space (and probably extra arguments).
Sidebar: argument counts and structures
One way to reduce the argument count to system calls is to pass some form of structure that aggregates things together instead of separate arguments. There are at least three strikes against this:
- on an abstract level you haven't actually reduced the argument count,
you've just hidden some of the arguments behind indirection.
- kernel implementors traditionally prefer to do as little chasing of
user space pointers as they can get away with. Every time you have
to dereference a user space pointer is more hassle (and more things
to carefully check for).
- using structures (especially C
structs) has historically been a land mine over the long term. The simpler the arguments are the easier it is to deal with things like 32 bit to 64 bit transitions, and you totally avoid compiler alignment and structure padding issues and so on.