2014-07-04
An interesting Go concurrency bug that I inflicted on myself
While working on my Go sinkhole SMTP server, I managed to stick myself with an interesting little concurrency bug that I feel like writing up today. The server takes a file of control rules, and to be simple it reloads and re-parses the control rules on every new connection. We do want to print a message if there's an error in the rules, but we don't want to print it on every connection; that could be a lot of duplicate messages, even concurrent duplicate messages (since each connection is handled in a separate goroutine). So I adopted the simple Go way of deduplicating messages in the face of concurrency: warning messages are sent down a channel to a single goroutine that receives them all, checks for repeated messages, and prints the message if it's not a repeat. I tested all of this and it worked fine; warning messages were printed, but only once.
Then I decided to be friendly and have the program immediately check the control rules during startup, so it could error out right away if there were problems. The code to do this looked like:
// start warn-once backend go warnbackend() _, isgood := setupRules(baserules) if !isgood { die("will not continue with rules problems.") }
When I tested this with a rules file with a deliberate error, the
program printed the 'will not continue' message and exited but did
not print the actual parsing error message. I spent rather a while
scratching my head and trying things before I realized what was
going on: I had a scheduling race. While setupRules()
had
dispatched its warning message down the channel and the warnbackend()
goroutine had picked it up (at least conceptually), the goroutine
hadn't gotten as far as actually printing out the message by the
time the main flow of code called die()
and the whole program
exited.
(The Go runtime doesn't currently print any warning messages if your program exits with active goroutines.)
This is actually a slightly subtle Go scheduling race. Go guarantees that sending something into an unbuffered channel will block until there is a ready receiver, but as I discovered this is not the same thing as guaranteeing that the receiver will do anything before you continue from the send. If you need the receiver of a message to do anything definite before you do something yourself, you need to do more than just send one message into the channel.
The cure for this bug was to force a synchronization point by
sending a null warning message just before calling die()
:
_, isgood := setupRules(baserules) if !isgood { warnonce("") die(....) }
This forces us to wait until warnbackend()
has processed and
printed any message (or messages) from setupRules()
and returned
to the point where it's waiting to receive something from the channel
again. warnbackend()
may or may not process our null message before
the program exits but we don't care about that.
(We know that warnbackend()
will process all messages from
setupRules()
before processing our null message because Go
guarantees that channel messages are delivered in order.)