Where (and how) you limit your concurrency in Go can matter

September 29, 2020

At the start of September, I wrote about how concurrency is still not easy even in Go, using a section of real code with a deadlock as the example. In that entry, I proposed three fixes to remove the deadlock. Since Hillel Wayne's Finding Goroutine Bugs with TLA+ has now formally demonstrated that all three of my proposed fixes work, I can talk about the practical differences between them.

For convenience, here's the original code from the first entry:

func FindAll() []P {
   pss, err := ps.Processes()
   [...]
   found := make(chan P)
   limitCh := make(chan struct{}, concurrencyProcesses)

   for _, pr := range pss {
      // deadlocks here:
      limitCh <- struct{}{}
      pr := pr
      go func() {
         defer func() { <-limitCh }()
         [... get a P with some error checking ...]
         // and deadlocks here:
         found <- P
      }()
   }
   [...]

   var results []P
   for p := range found {
      results = append(results, p)
   }
   return results
}

The buffered limitCh channel is used to implement a limited supply of tokens, to hold down the number of goroutines that are getting P's at once. The bug in this code is that the goroutines only receive from limitCh to release their token after sending their result to the unbuffered found channel, while the main code only starts receiving from found after running through the entire for loop, and the main code takes the token in the loop and blocks if no tokens are available. (For more, see the original entry.)

There are at least three fixes possible: the goroutines can send to limitCh instead of the main function doing it, the goroutines can receive from limitCh before sending to found, or the entire for loop can be in an additional goroutine so that it doesn't block the main function from starting to receive from found. All three of these fixes work, as proven by Hillel Wayne, but they have different effects on the number of goroutines that this code will run if pss is large and what the state of those goroutines is.

If our goal is to minimize resource usage, the worst fix is for goroutines to receive from limitCh before sending to found. This fix will cause almost all goroutines to stall in the send to found, because all but a few of them must be started and run almost to completion before the main code can finish the for loop and start receiving from found to unblock all of those sends and let the goroutines exit. These waiting to send goroutines are keeping used their fully expanded goroutine stacks, and possibly other resources that have not yet been released by them exiting and things becoming unused so the garbage collector can collect them (or by additional defer statements releasing things).

The middling fix is for goroutines to receive from limitCh instead of the for loop doing it. We will probably immediately create and start almost all of the full pss worth of goroutines, which could be bad if pss is very large, but at least they all block immediately with almost no resources used and with very small goroutine stacks. Still, this is a bunch of memory and a bunch of (Go) scheduler churn to start all of those goroutines only to have most of them immediately block receiving from limitCh. There's also going to be a lot of contention on internal runtime locks associated with limitCh, since a lot of goroutines are queueing up on it.

The best fix for resource usage is to push the for loop into its own goroutine but to otherwise keep things the same. Because the for loop is still receiving from limitCh before it creates a new goroutine, the number of simultaneous goroutines we ever have will generally be limited to around our desired concurrency level (there will be some extra that have received from limitCh but not yet finished completely exiting).

It's likely that none of this matters if the for loop only has to deal with a few hundred entries, and that's probably the case for this code (at least most of the time). But it makes for a useful illustration. When you're writing code with enforced limited concurrency it's probably worthwhile to think about where you want to limit the concurrency and what effects that has on things overall. As we can see here, small implementation choices can have potentially large impacts.

(Also, sometimes I think too much about this sort of thing.)

Written on 29 September 2020.
« My likely path away from spinning hard drives on my home desktop
Why the Unix newgrp command exists (sort of) »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Tue Sep 29 00:57:04 2020
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.