2020-07-05
A Go lesson learned: sometimes I don't want to use goroutines if possible
We have a heavily NFS based server environment here, with multiple NFS servers and an IMAP server that accesses all mailboxes over NFS. That IMAP server has had ongoing issues with elevated load averages, and what at least seems to be IMAP slowness. However, our current metrics leave a lot of uncertainties about the effects of all of this, because we basically only have a little bit of performance data for a few IMAP operations. One thing I'd like to do is gather some very basic Unix level NFS performance data from our IMAP server and from some other machines, to see if I can see anything.
One very simple metric is how long it takes to read a little file from every NFS filesystem we have mounted on a machine. As it happens, we already have the little files (they're used for another system management purpose), so all I need is a program to open and read each one while timing how long it takes. There's an obvious issue with doing this sequentially, which is that if there's a single slow filesystem, it could delay everything else.
The obvious answer here was Go, goroutines, and some form of goroutine pool. Because the goroutines just do IO (and they're only being used to avoid one bit of IO delaying another separate bit), the natural size of the goroutine pool is fairly large, say 50 to 100 goroutines (we have a lot of NFS filesystems). This is quite easy and obvious to implement in Go, so I put together a little Go program for it and watched the numbers it generated as they jumped around.
Then, out of reflexive caution, I tried running the same program with a goroutine pool size of 1, which more or less forced serial execution (the pool goroutine infrastructure was still active but there was only one worker goroutine doing all the reading). To my surprise the 'time to read a file' number for all filesystems was visibly and decidedly lower. I could run the program side by side with the two different goroutine pool sizes and see this clearly.
Some thinking gave me a possible reason why this is so. My core code does essentially the following (minus error checking):
start := time.Now() file, err := os.Open(target) n, err := file.Read(buffer) duration := time.Now().Sub(start)
This sequence makes two system calls and each system call is a potential goroutine preemption point. If a goroutine gets preempted during either system call, it can only record the finishing time once it's scheduled again (and finishes the read, if it was preempted in the open). If there are 50 or more goroutines all doing this, some of them could well be preempted and then not scheduled for some time, and that scheduling delay will show up in the final duration. When there aren't multiple goroutines active, there should be very little scheduling delay and the recorded durations (especially the longest durations) will be lower. And the ideal situation for essentially no goroutine contention is of course one goroutine.
(Technically this makes two more system calls to get the time at the start and the end of the sequence, but on modern systems, especially Linux, these don't take long enough to trigger Go's system call preemption and probably don't even enter the kernel itself.)
Because I still worry about individual slow filesystems slowing everything down (or stalls on some filesystems), my solution was a more complicated work pool approach that starts additional worker goroutines only when all of the current ones seem to have stalled for too long. If all goes well (and it generally does in my testing), this runs with only one goroutine.
(My current code has the drawback that once the goroutine worker pool expands, all of them stay active, which means that enough slow filesystems early on in the checks could get me back to the thundering herd problem. I'm still thinking about that issue.)