What should it mean for a system call to time out?
I was just reading Evan Klitzke's Unix System Call Timeouts (via) and among a number of thoughts about it, one of the things that struck me is a simple question. Namely, what should it mean for a Unix system call to time out?
This question may sound pointlessly philosophical, but it's actually
very important because what we expect a system call timeout to mean
will make a significant difference in how easy it would be to add
system calls with timeouts. So let's sketch out two extreme versions.
The first extreme version is that if a timeout occurs, the operation
done by the system call is entirely abandoned and undone. For
example, if you call
rename("a", "b") and the operation times
out, the kernel guarantees that the file
a has not been renamed
b. This is obviously going to be pretty hard, since the kernel
may have to reverse partially complete operations. It's also not
always possible, because some operations are genuinely irreversible.
write() data to a pipe and time out partway through doing
so (with some but not all data written), you cannot reach into the
pipe and 'unwrite' all of the already sent data; after all, some
of it may already have been read by a process on the other side of
The second extreme version is that having a system call time out merely causes your process to stop waiting for it to complete, with no effects on the kernel side of things. Effectively, the system call is shunted to a separate thread of control and continues to run; it may complete some time, or it may error out, but you never have to wait for it to do either. If the system call would normally return a new file descriptor or the like, the new file descriptor will be closed immediately when the system call completes. In practice implementing a strict version of this would also be relatively hard; you'd need an entire infrastructure for transferring system calls to another kernel context (or more likely, transplanting your user-level process to another kernel context, although that has its own issues). This is also at odds with the existing system calls that take timeouts, which generally result in the operation being abandoned part way through with no guarantees either way about its completion.
(For example, if you make a non-blocking
connect() call and then
select() to wait for it with a timeout, the kernel does not
guarantee that if the timeout fires the
connect() will not be
completed. You are in fact in a race between your likely
of the socket and the connection attempt actually completing.)
The easiest thing to implement would probably be a middle version.
If a timeout happens, control returns to your user level with a
timeout indication, but the operation may be partially complete and
it may be either abandoned in the middle of things or completed for
you behind your back. This satisfies a desire to be able to bound
the time you wait for system calls to complete, but it does leave
you with a messy situation where you don't know either what has
happened or what will happen when a timeout occurs. If your
times out, the directory may or may not exist when you look for it,
and it may or may not come into existence later on.
(Implementing timeouts in the kernel is difficult for the same reason that asynchronous IO is hard; there is a lot of kernel code that is much simpler if it's written in straight line form, where it doesn't have to worry about abandoning things part way through at essentially any point where it may have to wait for the outside world.)
Modern X Windows can be a very complicated environment
I mentioned Corebird, the GTK+ Twitter client the other day, and generally positive. That was on a logical weekend. The next day I went in to the office, set up Corebird there, and promptly ran into a problem: I couldn't click on links in Tweets, or rather I could but it didn't activate the link (it would often do other things). Corebird wasn't ignoring left mouse clicks in general, it's just that they wouldn't activate links. I had not had this problem at home (or my views would not have been so positive. I use basically the same fvwm-based window manager environment at home and at work, but since Corebird is a GTK+ application and GTK+ applications can be influenced by all sorts of magic settings and (Gnome) setting daemons, I assumed that it was something subtle that was different in my work GTK+/Gnome environment and filed a Fedora bug in vague hopes. To my surprise, it turned out to be not merely specific to fvwm, but specific to one aspect of my particular fvwm mouse configuration.
The full version is in the thread in the fvwm mailing list, but normally when you click and release a button, the X server generates two events, a ButtonPress and then a ButtonRelease. However, if fvwm was configured in a way such that it might need to do something with a left button press, a different set of events was generated:
- a LeaveNotify with mode NotifyGrab, to tell Corebird that the mouse pointer had been grabbed away from it (by fvwm).
- an EnterNotify with mode NotifyUngrab, to tell Corebird 'here is your mouse pointer back because the grab has been released' (because fvwm was passing the button press through to Corebird).
- the ButtonPress for the mouse button.
The root issue appears to be that something in the depths of GTK+ takes the LeaveNotify to mean that the link has lost focus. Since GTK+ doesn't think the link is focused, when it receives the mouse click it doesn't activate the link, but it does take other action, since it apparently still understands that the mouse is being clinked in the text of the GtkLabel involved.
If you think that this all sounds very complex, yes, exactly. It is. X has a complicated event model to start with, and then interactions with the window manager add extra peculiarities on top. The GTK+ libraries are probably strictly speaking in the wrong here, but I also rather suspect that this is a corner case that the GTK+ programmers never imagined, much less encountered. In a complex environment, some possibilities will drop through the cracks.
(If you want to read a high level overview of passive and active (mouse button) grabs, see eg this 2010 writeup by Peter Hutter. Having read it, I feel like I understand a bit more about what fvwm is doing here.)
By the way, some of this complexity is an artifact of the state of computing when X was created, specifically that both computers and networking were slow. Life would be simpler for everyone if all X events were routed through the window manager and then the window manager passed them on to client programs as appropriate. However, this would require all events to pass through an extra process (and possibly an extra one or two network hops), and in the days when X was young this could have had a real impact on overall responsiveness. So X goes to a great deal of effort to deliver events directly to programs whenever possible while still allowing the window manager to step in.
(My understanding is that in Wayland, the compositor handles all events and passes them to clients as it decides. The Wayland compositor is a lot more than just the equivalent of an X window manager, but it fills that role, and so in Wayland this issue wouldn't come up.)
Cheap concurrency is an illusion (at least on Unix)
[...] this memo assumes that there already exists an efficient concurrency implementation where forking a new lightweight process takes at most hundreds of nanoseconds and context switch takes tens of nanoseconds. Note that there are already such concurrency systems deployed in the wild. One well-known example are Golang's goroutines but there are others available as well.
When designing APIs and similar things, it is quite important to understand that extremely lightweight processes are an illusion. I don't mean that in the sense that they aren't actually lightweight in practice (although you probably want to pay attention to CPU cache effects here). I mean that in the sense that they don't actually exist and their nonexistence has real consequences.
All 'lightweight processes' on Unix are some form of what is known as 'green threads', which is to say that they exist purely at the user level. They are extremely cheap to create and switch to because all that the user level has to do here is shuffle some registers around or mark some memory as allocated for the initial stack. But the moment that you have to materialize a kernel entity to back a lightweight process (perhaps because it is about to do a blocking operation), things become much more expensive.
The reality is that there is no such thing as a genuinely lightweight kernel process, at least not in Unixes. Kernel processes have associated data structures and not insignificant kernel stacks, and they take involved locks to maintain reference counts on virtual memory areas and so on. Kernel processes are certainly a lot more lightweight these days than they used to be ten or twenty years ago, sufficiently so that POSIX threads are mostly 1:1 user to kernel threads because it's simpler, but they don't even start approaching the kind of lightweight you need to have as many of them as you can have, say, goroutines in a Go program. Systems like Go engage in a very careful behind the scenes dance in order to multiplex all of those goroutines onto many fewer kernel processes.
(One reason kernel processes need not insignificant kernel stacks is that Unix kernels are written in C, and C does not like having its stack relocated around in order to resize it. Languages like Go go to a great deal of effort in their compiler and runtime to make this work (and this can be alarmingly complicated).)
The consequence of this is that if you want to do a lot of concurrent things at once, at the level of the API to the kernel you can't do these things in a 1 to 1 relationship between a single thing and a kernel thread. If you try to have a 1:1 relationship, your system will explode under load with too many relatively expensive kernel threads. Instead you really must have some form of aggregation in the userspace-to-kernel API, regardless of whether or not you expose it to people.
This doesn't mean that the illusion of lightweight processes and cheap concurrency doesn't work or isn't useful. Usually it does work and is useful, partly because people find it much easier to reason about the behavior of straight-line code than they do about the behavior of state machines or callbacks. But it's an abstraction, with all that that implies, and if you are designing APIs you need to think about how they will be implemented at the level where this illusion is not true. Otherwise you may wind up designing APIs that promise things that you can't actually deliver.
(This holds true for more than just user to kernel APIs. If you design a Go based system where you appear to promise that a ton of goroutines can all do file IO all at once, you are probably going to have pain because in most environments each separate bit of file IO will require its own kernel thread. The Go runtime may do its best to deliver this, but you probably won't like what happens to your machine when a few thousand goroutines are busy trying to do their theoretically lightweight file IO.)