What I learned from Google Mail's recent outage

February 26, 2009

Suppose you have a system with nodes and work items. You have to assign work items to nodes somehow; one way to do it is to randomly distribute work items around the nodes, but another is to assign them based on some sort of fixed-outcome affinity function (like 'is topologically nearest'). Now consider what happens when a node overloads or fails (or is just taken out of service) and its work has to be reassigned to new nodes.

In a random-assignment system, the failed node's work is smeared broadly over all of the remaining nodes; each node only has to absorb a little bit of extra work. But in a fixed-affinity system, you are going to assign all of the work from the failed node to only a few nodes, the nodes that are 'closest' to the failed node. This will add significant load to them and may push one of those nodes into an overload failure; if this happens it adds yet more load to the remaining nearby nodes, and suddenly you have a cascade failure marching through your system.

(The more neighbors each node has the better, here, and conversely the fewer it has the more likely an overload is to happen.)

This possibility is probably not something that would have occurred to me until I read Google Mail's description of their recent outage (although I'm sure it's well known to experienced people in this field). Thus, the title of this entry is not sarcastic; Google's willingness to describe this led to me learning something potentially quite useful (or perhaps becoming conscious of it is a better description).

(Hence my quite generic description of the problem, since I think it can happen in any system with these characteristics. Distributed systems without fast work reassignment and some sort of load cutoff may be especially at risk, but I suspect that this also comes up in situations like scheduling processes and allocating memory on NUMA machines where CPU modules can be unplugged.)


Comments on this page:

From 80.99.93.127 at 2009-02-26 07:42:16:

there was a routine maintenance event in one of our European data centers.

Not closely related to affinity, in the same sense of the word, but it's another maintenance problem: if you are a global entity, with geographically disperse equipment and personnel, you can choose to do your maintenance in time and in space. Is it a good thing to do major changes in your European data center in European time? If the people responsible for maintaining it are there as well, it seems to be logical, because this is when they are awake. But considering the impact, it might not be.

-- Janos http://farkas.ch/

By cks at 2009-02-26 17:36:33:

My short answer is that if you have reliable procedures to transparently take things out of service and put them back in then yes, absolutely you should do maintenance during normal working hours when people are actually awake and fully functional. From all reports Google has such procedures and routinely does just this.

From 99.236.197.201 at 2009-02-27 08:30:44:

Sounds like Google should be talking to the folks who handled the power outage of 2003; their problems would seem to be very similar. Maybe they can hire some HydroOne types who got bought out.

MikeP

Written on 26 February 2009.
« Don't log usernames for bad logins
The peculiar case of the conference spammers »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Thu Feb 26 01:14:18 2009
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.