Wandering Thoughts archives

2020-09-01

Even in Go, concurrency is still not easy (with an example)

Go is famous for making concurrency easy, through good language support for goroutines. Except what Go makes easy is only one level of concurrency, the nuts and bolts level of making your code do things concurrently and communicating back and forth through channels. Making it do the right things concurrently is still up to you, and unfortunately Go doesn't currently provide a lot of standard library support for correctly implemented standard concurrency patterns.

For example, one common need is for a limited amount of concurrency; you want to do several things at once, but only so many of them. At the moment this is up to you to implement on top of goroutines, channels, and things like the sync package. This is not as easy as it looks, and quite competent people can make mistakes here. As it happens, I have an example ready to hand today.

Gops is a convenient command to list (and diagnose) Go processes that are currently running on your system. Among other things, it'll tell you which version of Go they were compiled with, which is handy if you want to see if you have out of date binaries that should be rebuilt and redeployed. One of the things gops needs to do is look at all of the Go processes on your system, which it does concurrently. However, it doesn't want to look at too many processes at once, because that can cause problems with file descriptor limits. This is a classic case of limited concurrency.

Gops implements this at the moment with code in goprocess.FindAll() that looks like this, in somewhat sketched and reduced form:

func FindAll() []P {
   pss, err := ps.Processes()
   [...]
   found := make(chan P)
   limitCh := make(chan struct{}, concurrencyProcesses)

   for _, pr := range pss {
      limitCh <- struct{}{}
      pr := pr
      go func() {
         defer func() { <-limitCh }()
         [... get a P with some error checking ...]
         found <- P
      }()
   }
   [...]

   var results []P
   for p := range found {
      results = append(results, p)
   }
   return results
}

(In the real code there's a WaitGroup for coordination, and the found channel gets closed appropriately.)

How this works is clear, and is a standard pattern (covered in eg Go 101's Channel Use Cases). We use a buffered channel to provide a limited number of tokens; sending a value into the channel implicitly takes a token (and blocks if the token supply is exhausted), while receiving a value from it puts a token back in. We take a token before we start a new goroutine, and the goroutine releases the token when it's done.

Except that this code has a bug if there are too many processes to examine. Even knowing that there is a bug in this code, it may not be obvious.

The bug is that the goroutines only receive from limitCh to release their token after sending their result to the unbuffered found channel, while the main code only starts receiving from found after running through the entire loop, and the main code takes the token in the loop and blocks if no tokens are available. So if you have too many processes to go through, you start N goroutines, they all block trying to write to found and don't receive from limitCh, and the main for loop blocks trying to send to limitCh and never reaches the point where it starts receiving from found.

At one level, this bug is a very fragile bug; it only exists because of multiple circumstances. If the goroutines took the token by sending to limitCh instead of the main for loop doing it, the bug would not exist; the main for loop would start them all, many would stop, and then it would go on to receive from found so that they could receive from limitCh and release their token so other goroutines would run. If the goroutines received from limitCh to release their token before sending to found, it wouldn't exist (but because of error handling, it's simpler and more reliable to do the receive in a defer). And if the entire for loop was in an additional goroutine, the main code would go on to receive from found and unblock completed goroutines to release their tokens, so the fact that the for loop was blocked waiting to send to limitCh wouldn't matter.

At another level, this shows how concurrency is not easy as easy as it looks in Go. All you need is one mistake and things skid to a halt, and all of the code involved can look good to a casual examination. Getting concurrency correct is simply hard for people (we can debate about why, but I think that it is is very clear).

(I'm sure that the people who wrote and approved the change that added this concurrency limiting code to gops were good programmers. A tricky case still tripped them up, passing all of their scrutiny. Even when I knew that there was a concurrency problem in the code and where it was (because my gops was hanging all of a sudden, and Delve told me where everything was stuck), it still took me some time to see what the exact problem was.)

programming/GoConcurrencyStillNotEasy written at 23:57:46; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.