2010-12-30
Why you need select() even with communication channels
Go has re-popularized the idea of handling
all of your blocking waiting-for-things operations by using CSP-like
communication channels instead of select() (in Go, using goroutines
and channels). However, it's my firm belief that this isn't good enough;
despite what some people think, you cannot replace select() in most
common CSP-like implementations.
The crucial ability that select() gives you is the ability to stop
waiting for something in response to some external event or change in
program state. In a select() based environment you stop trying to read
or write a file descriptor simply by omitting it from the set of file
descriptors you give to select(), and you get a chance to do this
every time IO happens (and you can make IO happen in response to other
events, using long standing tricks).
In a CSP-like environment, the traditional way to handle outside blocking operations is to perform them in a separate goroutine (in Go's terminology), forwarding the results to the rest of the program over a channel. The goroutine alternates between doing the blocking operation and talking to the channel (sending results to it, getting new requests from it, or both); the rest of the program can then wait for all of its IO, or continue processing, or whatever it wants.
It's relatively easy to interrupt such a goroutine if it's currently
trying to talk to the channel; you send it a 'poison pill' message that
tells it to shut down. However, sending a poison pill message does
nothing until the goroutine can pick it up; if the goroutine is blocked
in an outside operation such as read() or write(), it's not looking
for messages over its channel. Unless you can either forcefully kill the
goroutine or interrupt the blocking operation somehow, you're out of
luck. Most of the time you can't interrupt the blocking operation itself
(at least not without additional consequences that you don't want) and
most CSP-like implementations don't give you a way of killing goroutines
(because not allowing that simplifies the runtime environment).
Even without an explicit need to interrupt blocking operations, the result can be more complex simply because you need to communicate decisions about what to do back and forth between multiple pieces, some of which sometimes block and don't generate status messages when you'd like them to. For instance, consider the buffering logic for a network copying program, where you want to have a maximum size internal buffer that can be fed and drained asynchronously, with the reader side stopping reading from the network when the buffer is too full. I think that you wind up with an extra 'buffer' goroutine in the middle just to keep track of the buffer space remaining; you can't delegate the work to the write-out side, because the write-out side might be blocked when the reader needs to know if it should keep reading or stall.
(Disclaimer: I could be missing some well-known way around this here since I don't have much experience with CSP-like environments.)
Sidebar: the two uses of select() here
There are two uses of select() in this situation: waiting for multiple
IO sources at once, and allowing you to efficiently and accurately
report how much data is still waiting to be written (which only requires
waiting on a single IO source). What I'm writing about is the first
use. In the network copying example, I'm sort of handwaving the second
case by assuming that there is some way of doing it, possibly with support
from the runtime.
2010-12-26
A lesson for myself: write tests. Really.
As I've noted before, I am sort of a half-hearted user of code testing; I like it, but I haven't internalized it to the degree that I do it all the time, even if it's painful or annoying. Sometimes this bits me on the rear as a learning experience, as happened recently.
I have a Python module for dealing with ranges of IP addresses (often in the form of CIDR netblocks); I wrote it at least six years ago and have used it in various programs ever since. One of its interfaces is a method to add an improper CIDR to an IP address range object, for which the code goes roughly:
def addimproper(self, cidr):
(low, high) = convert(cidr, \
improper = True)
self.addrange(low, high)
Recently I wrote a program that needed to work with improper CIDRs and so used this interface. Except that it didn't work, and inspection of the IP address range objects showed that they were actually empty.
This was because my code had a bug. Actually reading it showed that
addimproper() was missing the last line, the one that actually added
the IP address range of the CIDR to itself. And this bug was not newly
introduced; it had been there since the beginning, six years ago, when
I wrote this method (and then never used it).
Some people would say that there are two failures here, but there's certainly at least one: I added this interface and then never tested it. Never mind formal tests; I never even loaded the module in the Python interpreter and tested that this method worked and did what I expected. (Given that there are various interpretations of improper CIDRs, both parts are important.)
This will hopefully serve as a salutary and humbling lesson to myself on the merits of writing tests. (Certainly it's been an embarrassing and humbling experience by itself; I just hope that I learn from it in practice, as opposed to just theory. I suppose that a good first step would be to augment the module's existing tests to cover this case.)
2010-12-25
Garbage-collected languages and memory allocation failures
Here is a thesis, in light of something I've seen recently: one of the weaknesses of today's high level dynamic languages is that they have basically no good way of handling being unable to allocate more memory. In fact I think this applies to pretty much any language that assumes garbage collection and automatic memory management.
There are two problems. First, dynamic languages only rarely allocate memory in more or less explicit ways, in operations which can already fail for other reasons (and thus your code already needs to be prepared for them to error out). Second, dynamic languages need to allocate memory in order to do very much of anything within the language.
The first problem effectively breaks the language semantics when memory allocation failures start; operations that should not fail start failing. Adding a key to a hash, concatenating two strings together, or even subtracting one number from another are now things that can make your code die. Almost every line and every statement can be a failure point. How do you possibly cope with that gracefully?
(This is where functional programming and other approaches where you never mutate data structures in place might have an advantage, because with care you can know that your core data structures are always intact. (Assuming that the language runtime itself properly copes with memory allocation failures halfway through operations like rebalancing the innards of a hash.))
The second problem means that there is often nothing that you can do to recover. If your process flat out cannot allocate more memory, your recovery code needs to operate in such a way that it doesn't need more memory, ie it cannot allocate new objects or even mutate existing objects in such a way that requires more memory for their internal implementation. Often this is simply impossible, because the language allocates new objects (such as call frames) as part of running your code. When it's not outright impossible it requires either extreme care (and a very deep understanding of your runtime environment) or strong care and support from the language implementation itself (eg to provide a reserved last-chance memory allocation arena that is only used when the out-of-memory recovery code is running). Either are at least challenging to implement.
Languages with explicit memory management avoid both issues (and the more explicit the better). Potential failure points are hopefully both relatively infrequent and visible, and at least in theory recovery code can run without needing memory allocations.
(There are practical difficulties, partly because many libraries assume that they can do memory allocations for internal purposes. You're probably not going to be throwing up GUI notifiers no matter what language your program is written in, for example.)
2010-12-21
Why I want tests to be easy to write
One possible answer to my testing problem from the last entry is that I should simply get to work creating the large test data needed. Some times programming involves doing tedious work that we don't like, and this is one of them.
I disagree, because one of my firm opinions is that it should be easy to write (good) tests. What I want is easy automated tests.
Everyone understands why you want the tests to be automated. Having the tests automated means that you will actually run the tests, all the tests, frequently or all the time. Similar logic is why people push for fast tests (the slower your automated tests run, the less often you run them).
As for the easy part, well, I know myself. If it is not easy to write tests, I will write what feels like the minimum number of tests and stop there. The easier that tests are to write, the more likely I am to write tests, in advance, for various exceptional or odd conditions. If writing a test for something odd is a 30 second exercise, I will throw it in just to make sure that my code doesn't have a corner case; I will essentially do exploratory testing. If writing a test is a ten minute slog of tedium, well, I don't go exploring.
This is especially important in something like our ZFS spares system, because a system for deploying spare disks is (pretty much by definition) all about handling exceptional conditions where something has gone wrong. So I actively want it to be easy to check that our spares system handles crazy situations, things that I think are totally implausible but you never know.
The TDD counter argument is probably that what I am doing here is an inefficient attempt to get to 100% code coverage and 100% branch coverage, and what I should really be doing is collecting metrics on that and working towards it. I'm not sure I have an informed opinion on this, but I definitely know that it's more reassuring to be able to write a test for some weird system state that I can think of and actually see the program handling it correctly than it is to just know that the program should handle it right for whatever reason.
(I'm a sysadmin. Trust is not in my nature; verification is.)
2010-12-20
I don't understand how to test complex data structures
One of my weaknesses as a modern programmer is that I don't really understand how to do test driven development. I understand the basic ideas of automated testing and unit testing and so on, and have used them to reasonable effect every so often; where I fall down is understanding how to test things in more advanced and challenging circumstances.
My current example of this is our ZFS spares handling system. Simplified slightly, the core program works by reading state information on all of the ZFS pools and their components (some of which may be incomplete), making multiple passes over this state information to refine it and generate a collection of higher level information, and then using all of this to detect what problems exist and decide what should be done about them. Because of how ZFS organizes its configuration information, the ZFS pool data structures wind up being multi-level and relatively large and complex (ZFS is in love with things nested inside things nested inside things). Because spares replacement is a global thing, the decisions the spares system makes are based on the entire system state, not on small local attributes of one particular bit of these data structures.
Doing proper test based development of this code certainly seems to require somehow manufacturing an entire artificially damaged set of pool configurations, ideally ones that accurately reflect our production fileservers. I don't know how I am supposed to do this in a TDD world, and I don't see any particularly good way to do it.
There are two vaguely plausible approaches. First, I could try to write the base state information from scratch. The problem is that state information is very large; even for a relatively small production fileserver it's over 500 attributes (some nested), and a full scale production fileserver that's experiencing problems will have well over a thousand attributes. Hand writing configurations of this size is sufficiently time consuming and tedious (and likely error-prone) that I am simply not going to get good situation coverage.
Alternately I could start with real state information for a working system and then selectively and automatically break it in various ways, so that it looked like disks had failed, other disks were in the process of being replaced with spares, and so on. The problem is that such modifications to the state information are themselves relatively complex once you get beyond simple situations. I would have to write an entire chunk of code to carefully mutate these data structures, including adding entirely new synthetic nesting elements that were created from scratch. This has much the same problems as complex mock objects; how do I know that my mutation code is correct?
(One plausible answer from testing people is that I should not have passive attribute-based data structures but instead hierarchies of objects with complex behaviors, and then I should substitute mock objects to represent broken objects. One of the many issues with this is that it proceeds straight to the complex mock objects issue.)
Presumably test driven development has an answer to this problem. I just don't know enough about how to do this to know what it is.
Sidebar: what I do right now
Right now, I test by hand by resorting to ad hoc manual techniques such as temporarily adding code to deliberately make specific bits go wrong. This has the advantage that I can make bits of the program lie to itself, but also all sorts of disadvantages and limitations; I can't automate it, I can't test truly complex things, my testing is necessarily somewhat indirect and artificial, and so on.
Oh, and I have to take the test code out before pushing updates to production (then put it back in the next time I have something to add or debug).
2010-12-09
A small request for C programmers: no static locals
I have a small suggestion for C programmers: please don't use static function local variables. Yes, I know, they've been part of C for a very long time, they are kind of nifty, and they're sometimes nice. But still, please use a file level static variable instead, which has just the same effects.
Well, almost. There is one important difference; in practice, many more tools can find and inspect file level static variables than can find and inspect function static variables. The result is that file static variables are actually accessible, while using static locals is a great way of making bits of state more or less invisible unless people try hard. This is an especially important (or bad) thing if you are building a system that is supposed to be observable and debuggable so yes, Solaris kernel, I am looking at you.
(In theory tools could do just as good a job with function static locals, since both file statics and function statics ultimately map to storage locations. All you need is symbol table entries, a suitable naming convention, and some elbow grease and programming. In practice, well, at least one of these ingredients is usually skipped.)
The reverse corollary is also true. If you are building something like SystemTap, DTrace, or a debugger or disassembler on a system where C's static locals make it into the symbol table, please invent some syntax for talking about them (if necessary) and then let people use it in appropriate places.
(If static locals do not make it into the symbol table and you have the ear of appropriate people, you can start working to change that.)
2010-12-06
What performance anomalies mean
I'm currently engaged in a slow-moving effort to improve the performance
of Python's os.walk() filesystem walking function (I have a
long-standing interest in it and recently
wrote about optimizing filesystem
walking on Unix). As part of the work I've been benchmarking several
variants of my code, and I stumbled over a performance anomaly where
the normal os.walk() is often faster than what should be a slightly
optimized version of it.
It's awfully easy to dismiss minor performance anomalies, especially if they happen in the early stages of optimizing something (in what you already expect to be only a minor improvement at best). But to do so is to miss what they mean:
Performance anomalies are a sign that you don't understand something important about what's really going on in your system.
You might ask why this matters, especially if your full optimizations are still faster than the original code.
We don't optimize code by making random changes and seeing if they speed the code up. Instead, we have a mental model of what is making the code slow and thus what can be done to make it faster. A performance anomaly means that some part of this mental model is wrong; either we don't actually understand why the existing code is slow, or we don't understand something about the runtime environment that makes our new code slower than we think it ought to be.
(I am going to assume that you have a clear explanation for why your new
code ought to be faster, such as 'it does only one stat() instead of
two'.)
Understanding performance anomalies is especially important in modern high level languages because those languages do so much under the surface, hidden behind their high levels of abstraction. But you can't do deep optimizations without understanding what's happening down in the depths of your runtime environment; performance is one of the things that always leaks through abstractions. So a mysterious performance anomaly is a sign that you don't understand what's behind the abstractions as well as you thought you did.
(PS: this doesn't mean that I understand why my little os.walk()
optimization turned out to be often slower. I haven't had the time to
look into it, especially because I expect it to be difficult to track
down.)