A Go lesson learned: sometimes I don't want to use goroutines if possible

July 5, 2020

We have a heavily NFS based server environment here, with multiple NFS servers and an IMAP server that accesses all mailboxes over NFS. That IMAP server has had ongoing issues with elevated load averages, and what at least seems to be IMAP slowness. However, our current metrics leave a lot of uncertainties about the effects of all of this, because we basically only have a little bit of performance data for a few IMAP operations. One thing I'd like to do is gather some very basic Unix level NFS performance data from our IMAP server and from some other machines, to see if I can see anything.

One very simple metric is how long it takes to read a little file from every NFS filesystem we have mounted on a machine. As it happens, we already have the little files (they're used for another system management purpose), so all I need is a program to open and read each one while timing how long it takes. There's an obvious issue with doing this sequentially, which is that if there's a single slow filesystem, it could delay everything else.

The obvious answer here was Go, goroutines, and some form of goroutine pool. Because the goroutines just do IO (and they're only being used to avoid one bit of IO delaying another separate bit), the natural size of the goroutine pool is fairly large, say 50 to 100 goroutines (we have a lot of NFS filesystems). This is quite easy and obvious to implement in Go, so I put together a little Go program for it and watched the numbers it generated as they jumped around.

Then, out of reflexive caution, I tried running the same program with a goroutine pool size of 1, which more or less forced serial execution (the pool goroutine infrastructure was still active but there was only one worker goroutine doing all the reading). To my surprise the 'time to read a file' number for all filesystems was visibly and decidedly lower. I could run the program side by side with the two different goroutine pool sizes and see this clearly.

Some thinking gave me a possible reason why this is so. My core code does essentially the following (minus error checking):

start := time.Now()
file, err := os.Open(target)
n, err := file.Read(buffer)
duration := time.Now().Sub(start)

This sequence makes two system calls and each system call is a potential goroutine preemption point. If a goroutine gets preempted during either system call, it can only record the finishing time once it's scheduled again (and finishes the read, if it was preempted in the open). If there are 50 or more goroutines all doing this, some of them could well be preempted and then not scheduled for some time, and that scheduling delay will show up in the final duration. When there aren't multiple goroutines active, there should be very little scheduling delay and the recorded durations (especially the longest durations) will be lower. And the ideal situation for essentially no goroutine contention is of course one goroutine.

(Technically this makes two more system calls to get the time at the start and the end of the sequence, but on modern systems, especially Linux, these don't take long enough to trigger Go's system call preemption and probably don't even enter the kernel itself.)

Because I still worry about individual slow filesystems slowing everything down (or stalls on some filesystems), my solution was a more complicated work pool approach that starts additional worker goroutines only when all of the current ones seem to have stalled for too long. If all goes well (and it generally does in my testing), this runs with only one goroutine.

(My current code has the drawback that once the goroutine worker pool expands, all of them stay active, which means that enough slow filesystems early on in the checks could get me back to the thundering herd problem. I'm still thinking about that issue.)


Comments on this page:

Do you have any numbers around the difference in timings (or rather the order of difference) you saw between max goroutines and a single goroutine?

Also, if you are really worried about goroutine preemption, would keep the pool size somewhere just short of GOMAXPROCS be good enough (leaving a couple for GC etc)? So about 15-20 on a 24 core server?

By Chris at 2020-07-06 13:38:16:

Why not decrease the worker pool size when the long running goroutine completes, using channels to signal completion, maybe?

Another thought I had was to use a buffered channel, so only $x number of workers can signal to increase pool size.

By cks at 2020-07-11 15:31:43:

In testing now between 'maximum goroutes at once' and 'one goroutine', it looks like the slowest filesystem checks are normally between two or three times slower, with some blips up to ten times slower.

Decreasing the worker pool size on the fly requires more coordination, which means more code (and probably more channels). You need some way for worker goroutines to decide when to exit, but you don't want all of them to exit on a fixed schedule (the ideal is for one worker goroutine to run all through). I'm sure I could implement it but the simpler version works well enough for now.

(There are relatively simple implementations, such as making additional worker goroutines only process a few filesystems, but I think they involve more goroutine churn and overall delays.)

By Laurent Verdoïa at 2020-10-31 03:44:07:

Hi,

There is a big focus on individual goroutine, but there is no number about the overall throughput. With a lot of goroutine and few CPUs, you expect to have a higher latency but since several goroutine are running, overall throughput should increase. It can be interesting to test again with few more stats.

By cks at 2020-10-31 15:05:34:

In this specific situation, what matters most is accurately timing the system level operations. Overall throughput is not a useful metric if it creates timing results that are distorted by goroutine scheduling delays, which seems to be the case; we would much rather have less throughput and more accurate measurements (up to a point, because of other effects).

Written on 05 July 2020.
« How you get multiple TLS certificate chains from a server certificate
I now think that blog 'per day' pages with articles are a mistake »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sun Jul 5 23:52:25 2020
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.