Cheap concurrency is an illusion (at least on Unix)
[...] this memo assumes that there already exists an efficient concurrency implementation where forking a new lightweight process takes at most hundreds of nanoseconds and context switch takes tens of nanoseconds. Note that there are already such concurrency systems deployed in the wild. One well-known example are Golang's goroutines but there are others available as well.
When designing APIs and similar things, it is quite important to understand that extremely lightweight processes are an illusion. I don't mean that in the sense that they aren't actually lightweight in practice (although you probably want to pay attention to CPU cache effects here). I mean that in the sense that they don't actually exist and their nonexistence has real consequences.
All 'lightweight processes' on Unix are some form of what is known as 'green threads', which is to say that they exist purely at the user level. They are extremely cheap to create and switch to because all that the user level has to do here is shuffle some registers around or mark some memory as allocated for the initial stack. But the moment that you have to materialize a kernel entity to back a lightweight process (perhaps because it is about to do a blocking operation), things become much more expensive.
The reality is that there is no such thing as a genuinely lightweight kernel process, at least not in Unixes. Kernel processes have associated data structures and not insignificant kernel stacks, and they take involved locks to maintain reference counts on virtual memory areas and so on. Kernel processes are certainly a lot more lightweight these days than they used to be ten or twenty years ago, sufficiently so that POSIX threads are mostly 1:1 user to kernel threads because it's simpler, but they don't even start approaching the kind of lightweight you need to have as many of them as you can have, say, goroutines in a Go program. Systems like Go engage in a very careful behind the scenes dance in order to multiplex all of those goroutines onto many fewer kernel processes.
(One reason kernel processes need not insignificant kernel stacks is that Unix kernels are written in C, and C does not like having its stack relocated around in order to resize it. Languages like Go go to a great deal of effort in their compiler and runtime to make this work (and this can be alarmingly complicated).)
The consequence of this is that if you want to do a lot of concurrent things at once, at the level of the API to the kernel you can't do these things in a 1 to 1 relationship between a single thing and a kernel thread. If you try to have a 1:1 relationship, your system will explode under load with too many relatively expensive kernel threads. Instead you really must have some form of aggregation in the userspace-to-kernel API, regardless of whether or not you expose it to people.
This doesn't mean that the illusion of lightweight processes and cheap concurrency doesn't work or isn't useful. Usually it does work and is useful, partly because people find it much easier to reason about the behavior of straight-line code than they do about the behavior of state machines or callbacks. But it's an abstraction, with all that that implies, and if you are designing APIs you need to think about how they will be implemented at the level where this illusion is not true. Otherwise you may wind up designing APIs that promise things that you can't actually deliver.
(This holds true for more than just user to kernel APIs. If you design a Go based system where you appear to promise that a ton of goroutines can all do file IO all at once, you are probably going to have pain because in most environments each separate bit of file IO will require its own kernel thread. The Go runtime may do its best to deliver this, but you probably won't like what happens to your machine when a few thousand goroutines are busy trying to do their theoretically lightweight file IO.)