Van Jacobson illustrates the importance of cache effects
One of the most exciting presentation at the recent linux.conf.au 2006 conference (ironically held in New Zealand) was Van Jacobson's talk on speeding up the Linux networking stack. On a TCP stack that he says is already one of the fastest going, he managed to well over double the performance, while removing and simplifying code, even in driver hot paths.
One of the things Van Jacobson did in this was to convert many of the queues used in the networking layers from linked lists (with locks) to what he calls a 'cache aware, cache-friendly queue' that is also lock-free. Converting just the driver level queues to these channels resulted in CPU usage dropping from 78% to 58% on a dual-CPU test machine.
This is not black magic; instead, this is a vivid illustration of just how much locks and cache contention cost you in practice. Multiple writers and atomic operations are now hugely expensive, so as Van Jacobson says, 'to go fast you want to have a single writer per [cache] line and no locks' (page 21 of his slides).
I can't help but note that that Van Jacobson's channels and his overall TCP processing architecture built around them don't look very much like the conventional threaded way to do parallel programming. Instead they smell a lot more like Hoare's CSP to me.