Go's sync.Pool has (undocumented) 'thread' locality
I was recently reading Andrei Pechkurov's Thread-Local State in Go, Huh? (via), which told me something surprising:
I'm talking of
sync.Pool. If you're familiar with its source code, you already know that it uses thread-local pools under the hood. If we allocate a struct and place it in the pool, the next time we request it one the same thread (but not necessarily same goroutine) we should get the same struct.
First, let's say the obvious thing: this
is undocumented and so may change at any time, if the Go developers
feel that it should be done a different way or just if they get
annoyed at people building code around it. The second thing to say
is that this doesn't mean what you want and it's not necessarily
predictable, although it's more predictable than I initially thought.
Go (currently) uses an M:N work stealing scheduler to multiplex goroutines on to OS threads. The scheduler has three important sorts of entities: a G is a goroutine, an M is an OS thread (a 'machine'), and a P is a 'processor', which at its core is a limited resource that must be claimed by an M in order to run user-level Go code. What sync.Pool is (currently) doing in its local pools is 'P-local pools' (as far as I can tell).
There are always N Ps (where N is the amount of 'CPUs' Go is allowed to use). In a steady state of computation, there are more or less N active Ms, each of which claims a particular P, that are scheduling and running goroutines (Gs), and generally a G will stay with an M and thus a P. However, this can get perturbed if, for example, you're making synchronous system calls. There's also no guarantees that your OS will keep running the OS level thread (an M, holding a P) on the same actual CPU as before; it may get bumped off by other things that want the CPU and then re-scheduled on to a different idle CPU. The association between Ps and system CPUs is only a loose one, which means that you may not get as CPU cache locality from these 'local pools' in sync.Pool as you could hope for.
What the P-local pools are good at is reducing contention. Only one goroutine can be associated with a P at any one time, so that goroutine (generally) isn't contending with anything else when it adds something to a P-local part of the pool or gets an available object from it. And in fact sync.Pool has a second level system to avoid as much locking as possible (in poolqueue.go), where one P can take an item from another P's chunk of the pool if its own chunk is empty.
What this light exploration of
sync.Pool has taught me is
that sync.Pool has a much more sophisticated and optimized
implementation than I would have expected. You could implement a
version of sync.Pool with relatively simple mutexes (and maybe
atomics), but the actual Go standard library goes to some effort
to make it efficient in the face of significant concurrency. Perhaps
this shouldn't be surprising, since sync.Pool is used in some hot
spots in the result of the standard library (the sync.Pool documentation
fmt as an example).