Operating system threads are always going to be (more) expensive

September 7, 2024

Recently I read Asynchronous IO: the next billion-dollar mistake? (via). Among other things, it asks:

Now imagine a parallel universe where instead of focusing on making asynchronous IO work, we focused on improving the performance of OS threads [...]

I don't think this would have worked as well as you'd like, at least not with any conventional operating system. One of the core problems with making operating system threads really fast is the 'operating system' part.

A characteristic of all mainstream operating systems is that the operating system kernel operates in a separate hardware security domain than regular user (program) code. This means that any time the operating system becomes involved, the CPU must do at least two transitions between these security domains (into kernel mode and then back out). Doing these transitions is always more costly than not doing them, and on top of that the CPU's ISA often requires the operating system go through non-trivial work in order to be safe from user level attacks.

(The whole speculative execution set of attacks has only made this worse.)

A great deal of the low level work of modern asynchronous IO is about not crossing between these security domains, or doing so as little as possible. This is summarized as 'reducing system calls because they're expensive', which is true as far as it goes, but even the cheapest system call possible still has to cross between the domains (if it is an actual system call; some operating systems have 'system calls' that manage to execute entirely in user space).

The less that doing things with threads crosses the CPU's security boundary into (and out of) the kernel, the faster the threads go but the less we can really describe them as 'OS threads' and the harder it is to get things like forced thread preemption. And this applies not just for the 'OS threads' themselves but also to their activities. If you want 'OS threads' that perform 'synchronous IO through simple system calls', those IO operations are also transitioning into and out of the kernel. If you work to get around this purely through software, I suspect that what you wind up with is something that looks a lot like 'green' (user-space) threads with asynchronous IO once you peer behind the scenes of the abstractions that programs are seeing.

(You can do this today, as Go's runtime demonstrates. And you still benefit significantly from the operating system's high efficiency asynchronous IO, even if you're opting to use a simpler programming model.)

(See also thinking about event loops versus threads.)


Comments on this page:

By cannon at 2024-09-07 01:30:24:

A characteristic of all mainstream operating systems is that the operating system kernel operates in a separate hardware security domain than regular user (program) code. This means that any time the operating system becomes involved, the CPU must do at least two transitions between these security domains (into kernel mode and then back out).

No, it doesn't mean that. While there's a certain amount of equivocation regarding "conventional operating system[s]" and "if it is an actual system call", it's not clear how much was meant to apply to this seemingly-absolute statement, which I don't believe is true.

The concept of "the CPU" transitioning died about 15 years ago; now, it's a CPU core that transitions (possibly a virtual core, if there's SMT). In conventional operating systems, anyway. But with multiple cores available, a programmer could arrange to already have a core in the target state, in which case no transition would be needed. The article mentions io_uring as a possible communication channel.

On current CPUs, dedicating a core this way is likely to decrease overall performance. But when the author writes "imagine a parallel universe", I think it's implied that conventional software and hardware designs are to be questioned, and that "OS" and "OS thread" might end up looking unfamiliar. There could be an interesting research project in figuring out what this parallel universe might look like.

The syscall boundary could always be removed - switch to asynchronous shared memory ringbuffers for communication, like user-space network stacks etc. This would of course then require a spinlock (maybe upgrading to a park or yield after a certain number of cycles) to await a core-next-door to do the privileged operations and get a response if you didn’t want to complicate the end user API, and ofc perhaps it’s more efficient to make state transitions into and out of kernel mode to take advantage of data locality. Makes a good security boundary though, message passing to another core makes it harder to create sidechannel leaks of core state.

By Alex Shpilkin at 2024-09-07 13:52:38:

From historical documents on NPTL (the Glibc pthreads implementation since 2003), it looks to me that there is an alternate world where Glibc threads are green by default, aided by an implementation of scheduler activations in the Linux kernel.

In this one, people did try, but they gave up. I remember a characteristically bombastic footnote by Drepper saying that anyone who thought scheduler-activation-based threads were possible was a fool—can’t find the exact quote, though. The important caveat was that he was talking about an implementation using the Linux syscall interface ca. 2002, where a good one would indeed probably use an async interface along the lines of io_uring instead, I agree with you about that. So instead of a (known!) proper solutoon we make do by YOLOing thread pool sizes and hoping the kernel will schedule them in some reasonable way.

Just a little bit of systems-software-research-is-irrelevant blues, I guess.

I also read that article, and agreed with its desire. Efficiency shouldn't come at the expense of comprehensibility, which is exactly what's lost when splitting up a simple operation into its smallest parts, manually, to cope with blocking and other nonsense. It's far easiest to be able to express an algorithm directly, and this is part of why I wish to learn Erlang someday.

One of the core problems with making operating system threads really fast is the 'operating system' part.

I agree. Here's a relevant quote:

An operating system is a collection of things that don't fit into a language. There shouldn't be one.

-Dan Ingalls

A characteristic of all mainstream operating systems is that the operating system kernel operates in a separate hardware security domain than regular user (program) code.

Yes, mainstream is the key word there, considering that plenty of operating systems use a single address space and other mechanisms to enforce security. Consider the Lisp Machines, which used the language security to prohibit code from running willy nilly, instead of letting each program pretend it had total control over the machine.

I recommend to the commentator named cannon to look into data-flow machines.

By cannon at 2024-09-08 13:05:07:

I recommend to the commentator named cannon to look into data-flow machines.

Thanks. It looks like there could be something interesting there, though I'm having trouble figuring out what exactly "data-flow machines" are. The Wikipedia article isn't very clear. The Venn paper it references is on Sci-hub, and I'll have to give that a read. The also-referenced Manchester page lists about 80 things, but with no links or DOI numbers whatsoever (apparently, I could get some magnetic tapes with Pascal software, for a charge of "=A350", whatever that means). So, that's probably a dead-end unless someone's sufficiently interested to head to a university library for a day of research.

I was also thinking of the (decade-old vaporware) Mill architecture when writing that. In particular, "Spawning a thread, dispatching one for execution, idling it, killing it, […] are all user-mode hardware operations on a Mill CPU, with performance comparable to a normal function call. The talk will describe in detail how this works, with examples from micro-kernel operating systems and concurrent languages like Go." It's a single-address-space design (excepting some compatibility mode to support the Unix "fork" call).

It looks like there could be something interesting there, though I'm having trouble figuring out what exactly "data-flow machines" are.

I can help with that. I too struggled to imagine the realization of data-flow machines. I reviewed a short document about one here.

My original theory “The Math-based Grand Unified Programming Theory: The Pure Function Pipeline Data Flow with Principle-based Warehouse/Workshop Model” has been discussed and solved long ago. The solution has already been discussed and solved. See also:

- The unification of single-threaded, multi-threaded, asynchronous and distributed

 - async/await, Project Loom fiber, Gantt Chart, and Scientific Management

https://github.com/linpengcheng/PurefunctionPipelineDataflow?tab=readme-ov-file#async

Written on 07 September 2024.
« The problems (Open)ZFS can have on new Linux kernel versions
I wish (Linux) WireGuard had a simple way to restrict peer public IPs »

Page tools: View Source, View Normal.
Search:
Login: Password:

Last modified: Sat Sep 7 00:01:53 2024
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.