What I learned from Google Mail's recent outage

February 26, 2009

Suppose you have a system with nodes and work items. You have to assign work items to nodes somehow; one way to do it is to randomly distribute work items around the nodes, but another is to assign them based on some sort of fixed-outcome affinity function (like 'is topologically nearest'). Now consider what happens when a node overloads or fails (or is just taken out of service) and its work has to be reassigned to new nodes.

In a random-assignment system, the failed node's work is smeared broadly over all of the remaining nodes; each node only has to absorb a little bit of extra work. But in a fixed-affinity system, you are going to assign all of the work from the failed node to only a few nodes, the nodes that are 'closest' to the failed node. This will add significant load to them and may push one of those nodes into an overload failure; if this happens it adds yet more load to the remaining nearby nodes, and suddenly you have a cascade failure marching through your system.

(The more neighbors each node has the better, here, and conversely the fewer it has the more likely an overload is to happen.)

This possibility is probably not something that would have occurred to me until I read Google Mail's description of their recent outage (although I'm sure it's well known to experienced people in this field). Thus, the title of this entry is not sarcastic; Google's willingness to describe this led to me learning something potentially quite useful (or perhaps becoming conscious of it is a better description).

(Hence my quite generic description of the problem, since I think it can happen in any system with these characteristics. Distributed systems without fast work reassignment and some sort of load cutoff may be especially at risk, but I suspect that this also comes up in situations like scheduling processes and allocating memory on NUMA machines where CPU modules can be unplugged.)

Written on 26 February 2009.
« Don't log usernames for bad logins
The peculiar case of the conference spammers »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Thu Feb 26 01:14:18 2009
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.