2014-07-06
Goroutines versus other concurrency handling options in Go
Go makes using goroutines and channels very attractive; they're consciously put forward as the language's primary way of doing concurrency and thus the default solution to any concurrency related issue you may have. However I'm not sure that they're the right approach for everything I've run into, although I'm still mulling over what the balance is.
The sort of problem that channels and goroutines don't seem an entirely smooth fit for is querying shared state (or otherwise getting something from it). Suppose that you're keeping track of the set of SMTP client IPs that have tried to start TLS with you but have failed; if a client has failed TLS setup, you don't want to offer it TLS again (or at least not within a given time). Most of the channel-based solution is straightforward; you have a master goroutine that maintains the set of IPs privately and you add IPs to it by sending a message down the channel to the master. But how do you ask the master goroutine if an IP is in the set? The problem is that you can't get a reply from the master on a common shared channel because there is no way for the master to reply specifically to you.
The channel based solution for this that I've seen is to send a reply channel as part of your query to the master (which is sent over a shared query channel). The downside of this approach is the churn in channels; every request allocates, initializes, uses once, and then destroys a channel (and I think they have to be garbage collected, instead of being stack allocated and quietly cleaned up). The other option is to have a shared data structure that is explicitly protected by locks or other facilities from the sync package. This is more low level and requires more bookkeeping but you avoid bouncing channels around.
But efficiency is probably not the right concern for most Go programs I'll ever write. The real question is which is easier to write and results in clearer code. I don't have a full conclusion but I do have a tentative one, and it's not entirely the one I expected: locks are easier if I'm dealing with more than one sort of query against the same shared state.
The problem with the channel approach in the face of multiple sorts of queries is that it requires a lot of what I'll call type bureaucracy. Because channels are typed, each different sort of reply needs a type (explicit or implicit) to define what is sent down the reply channel. Then basically each different query also needs its own type, because queries must contain their (typed) reply channel. A lock based implementation doesn't make these types disappear but it makes them less of a pain because they are just function arguments and return values and thus they don't have to be formally defined out as Go types and/or structs. In practice this winds up feeling more lightweight to me, even with the need to do explicit manual locking.
(You can reduce the number of types needed in the channel case by merging them together in various ways but then you start losing type safety, especially compile time type safety. I like compile time type safety in Go because it's a reliable way of telling me if I got something obvious wrong and it helps speed up refactoring.)
In a way I think that channels and goroutines can be a form of Turing tarpit, in that they can be used to solve all of your problems if you're sufficiently clever and it's very tempting to work out how to be that clever.
(On the other hand sometimes channels are a brilliant solution to a problem that might look like it had nothing to do with them. Before I saw that presentation I would never have thought of using goroutines and channels in a lexer.)
Sidebar: the Go locking pattern I've adopted
This isn't original to me; I believe I got it from the Go blog entry on Go maps in action. Presented in illustrated form:
// actual entries in our shared data structure type ipEnt struct { when time.time count int } // the shared data structure and the lock // protecting it, all wrapped up in one thing. type ipMap struct { sync.RWMutex ips map[string]*ipEnt } var notls = &ipMap{ips: make(map[string]*ipEnt)} // only method functions manipulate the shared // data structure and they always take and release // the lock. outside callers are oblivious to the // actual implementation. func (i *ipMap) Add(ip string) { i.Lock() ... manipulate i.ips ... i.Unlock() }
Using method functions feels the most natural way to manipulate the data structure, partly because how you manipulate it is very tightly bound to what it is due to locking requirements. And I just plain like the syntax for doing things with it:
if res == TLSERROR { notls.Add(remoteip) .... }
The last bit is a personal thing, of course. Some people will prefer
standalone functions that are passed the ipMap
as an explicit
argument.
The problem with filenames in IO exceptions and errors
These days a common pattern in many languages is to have errors or error exceptions be basically strings. They may not literally be strings but often the only thing people really do with them is print or otherwise report their string form. Python and Go are both examples of this pattern. In such languages it's relatively common for the standard library to helpfully embed the name of the file that you're operating on in the error message for operating system IO errors. For example, the literal text of the errors and exceptions you get for trying to open a file that you don't have access to in Go and Python are:
open /etc/shadow: permission denied [Errno 13] Permission denied: '/etc/shadow'
This sounds like an attractive feature, but there is a problem with it: unless the standard library does it all the time and documents it, people can't count on it, and when they can't count on it you wind up with ugly error messages in practice unless people go quite out of their way.
This stems from one of the fundamental rules of good (Unix) error messages for programs, which is thou shalt always include the name of the file you had problems with. If you're writing a program and you need to produce an error message, it is ultimately your job to make sure that the filename is always there. If the standard library gives you errors that sometimes but not always include the filename, or that are not officially documented as including the filename, you have no real choice but to include the filename yourself. Then when the standard library's error or exception does include the filename, the whole error message emitted by your program winds up mentioning the filename twice:
sinksmtp: cannot open rules file /not/there: open /not/there: no such file or directory
It's tempting to say that the standard library should always include the filename in error messages (and explicitly guarantee this). Unfortunately this is very hard to do in general, at least on Unix and with a truly capable standard library. The problem is that you can be handed file descriptors from the outside world and required to turn them into standard file objects that you can do ordinary file operations on, and of course there is no (portable) way to find out the file name (if any) of these file descriptors.
(Many Unixes provide non-portable ways of doing this, sometimes
brute force ones; on Linux, for example, one approach is to look
at /proc/self/fd/<N>
.)