An interesting Go concurrency bug that I inflicted on myself

July 4, 2014

While working on my Go sinkhole SMTP server, I managed to stick myself with an interesting little concurrency bug that I feel like writing up today. The server takes a file of control rules, and to be simple it reloads and re-parses the control rules on every new connection. We do want to print a message if there's an error in the rules, but we don't want to print it on every connection; that could be a lot of duplicate messages, even concurrent duplicate messages (since each connection is handled in a separate goroutine). So I adopted the simple Go way of deduplicating messages in the face of concurrency: warning messages are sent down a channel to a single goroutine that receives them all, checks for repeated messages, and prints the message if it's not a repeat. I tested all of this and it worked fine; warning messages were printed, but only once.

Then I decided to be friendly and have the program immediately check the control rules during startup, so it could error out right away if there were problems. The code to do this looked like:

// start warn-once backend
go warnbackend()

_, isgood := setupRules(baserules)
if !isgood {
  die("will not continue with rules problems.")

When I tested this with a rules file with a deliberate error, the program printed the 'will not continue' message and exited but did not print the actual parsing error message. I spent rather a while scratching my head and trying things before I realized what was going on: I had a scheduling race. While setupRules() had dispatched its warning message down the channel and the warnbackend() goroutine had picked it up (at least conceptually), the goroutine hadn't gotten as far as actually printing out the message by the time the main flow of code called die() and the whole program exited.

(The Go runtime doesn't currently print any warning messages if your program exits with active goroutines.)

This is actually a slightly subtle Go scheduling race. Go guarantees that sending something into an unbuffered channel will block until there is a ready receiver, but as I discovered this is not the same thing as guaranteeing that the receiver will do anything before you continue from the send. If you need the receiver of a message to do anything definite before you do something yourself, you need to do more than just send one message into the channel.

The cure for this bug was to force a synchronization point by sending a null warning message just before calling die():

_, isgood := setupRules(baserules)
if !isgood {

This forces us to wait until warnbackend() has processed and printed any message (or messages) from setupRules() and returned to the point where it's waiting to receive something from the channel again. warnbackend() may or may not process our null message before the program exits but we don't care about that.

(We know that warnbackend() will process all messages from setupRules() before processing our null message because Go guarantees that channel messages are delivered in order.)

Comments on this page:

By at 2015-12-17 00:05:49:

I guess the warning("") is taking the extra processor time which takes to recieve the error message (please correct me if I am wrong). Would it not have been possible to do the same by using select{case <-recieve chan string:}? this would have ensured that the execution of the main function will wait until a message is recieved or go to the default case, which can be to print"exited without getting error message".

By cks at 2015-12-17 00:20:56:

The warnonce("") isn't there to use up some processor time, which would be unreliable in any case. It's there to force a wait until the warning goroutine has fully processed a previous warning message. Since this is an unbuffered channel, we won't return from warnonce() until the goroutine has done a channel receive, and it won't do a channel receive for our new message until it's actually printed out the old one.

I can't think of any other easy way to do this, although there are all sorts of more heavyweight methods. At a guess, the semi-simplest would be an additional pair of channels used to signal shutdown to the goroutine and then get its acknowledgement of having shut down (the latter being the important thing here).

Written on 04 July 2014.
« Bash is letting locales destroy shell scripting (at least on Linux)
Another reason to use frameworks like Django »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Jul 4 00:54:59 2014
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.