Some thoughts from a close view of catastrophe

September 30, 2011

The Computer Science department (who I work for) is spread out over three buildings, which means that we have switch racks in wiring closets or machine rooms in all three buildings. For historical reasons, our switch racks in the smallest building are in an Electrical Engineering machine room. EE mostly uses the machine room to house some research clusters and computing, but they also have their switches for the building in a rack by ours in the corner. The weekend before last, a series of events lead to this machine room's air conditioning failing and then a single ceiling water sprinkler activating. The sprinkler apparently ran for at least 45 minutes before it was shut off and for extra points, the power was live in the room while this was happening.

(Also in the room at the time was a just purchased, just barely turned on half rack of compute servers that belongs to one of our research groups.)

The EE department's building switches avoided the water entirely by about an inch. Most of our switches got wet and some of them died (we poured water out of a number of them, some of which seem to actually still work now that they've dried out). But the research machines were drenched (in fact often literally flooded), especially one densely packed full-height rack of compute servers that was basically in the worst possible location relative to the sprinkler and its flood of gunk and grunge laden water.

For the EE networking people this is a narrow escape from a bad situation (they have two relatively high end, relatively expensive routing switches in the room). For us it's unpleasant, but we have spares (well, had spares, many of them are now deployed).

For the researchers it's been catastrophic. It's now almost two weeks since the incident and their machines are still off, just sitting there. At least some of them may never be powered on again. A lot of their hard drives are probably dead, along with some unknown amount of other equipment like switches and KVMs. It's almost certain to be more weeks before there's any prospect of reassembling a running cluster. In a very real way they've lost the entire machine room for weeks.

As I've watched things unfold and periodically gone by the machine room to see that things are still powered off, I can't help but think uncomfortable thoughts. Our machine room is about a hundred yards away from this machine room. It has sprinklers, many of them at least as old as the one that activated. This could have happened to us.

We could be looking at all of our central machines being down for weeks; all of our fileservers, all of our backends, all of our login servers, all of the firewalls and routers, everything. We could be looking at trying to glue together some sort of vaguely functional environment on a crash basis using whatever spare hardware we can scrounge up or beg from people. We could be trying to prioritize what services come back and who gets their data restored versus who has to wait until we have enough disk space to hold it all.

I look at the EE machine room, and I can't help but thinking 'thank whatever that it's not us'.

Written on 30 September 2011.
« Unit testing by analogy to scientific hypotheses
Understanding a tricky bit of Python generators »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Sep 30 23:09:21 2011
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.