What I learned from Google Mail's recent outage
Suppose you have a system with nodes and work items. You have to assign work items to nodes somehow; one way to do it is to randomly distribute work items around the nodes, but another is to assign them based on some sort of fixed-outcome affinity function (like 'is topologically nearest'). Now consider what happens when a node overloads or fails (or is just taken out of service) and its work has to be reassigned to new nodes.
In a random-assignment system, the failed node's work is smeared broadly over all of the remaining nodes; each node only has to absorb a little bit of extra work. But in a fixed-affinity system, you are going to assign all of the work from the failed node to only a few nodes, the nodes that are 'closest' to the failed node. This will add significant load to them and may push one of those nodes into an overload failure; if this happens it adds yet more load to the remaining nearby nodes, and suddenly you have a cascade failure marching through your system.
(The more neighbors each node has the better, here, and conversely the fewer it has the more likely an overload is to happen.)
This possibility is probably not something that would have occurred to me until I read Google Mail's description of their recent outage (although I'm sure it's well known to experienced people in this field). Thus, the title of this entry is not sarcastic; Google's willingness to describe this led to me learning something potentially quite useful (or perhaps becoming conscious of it is a better description).
(Hence my quite generic description of the problem, since I think it can happen in any system with these characteristics. Distributed systems without fast work reassignment and some sort of load cutoff may be especially at risk, but I suspect that this also comes up in situations like scheduling processes and allocating memory on NUMA machines where CPU modules can be unplugged.)