Wandering Thoughts archives

2011-09-30

Some thoughts from a close view of catastrophe

The Computer Science department (who I work for) is spread out over three buildings, which means that we have switch racks in wiring closets or machine rooms in all three buildings. For historical reasons, our switch racks in the smallest building are in an Electrical Engineering machine room. EE mostly uses the machine room to house some research clusters and computing, but they also have their switches for the building in a rack by ours in the corner. The weekend before last, a series of events lead to this machine room's air conditioning failing and then a single ceiling water sprinkler activating. The sprinkler apparently ran for at least 45 minutes before it was shut off and for extra points, the power was live in the room while this was happening.

(Also in the room at the time was a just purchased, just barely turned on half rack of compute servers that belongs to one of our research groups.)

The EE department's building switches avoided the water entirely by about an inch. Most of our switches got wet and some of them died (we poured water out of a number of them, some of which seem to actually still work now that they've dried out). But the research machines were drenched (in fact often literally flooded), especially one densely packed full-height rack of compute servers that was basically in the worst possible location relative to the sprinkler and its flood of gunk and grunge laden water.

For the EE networking people this is a narrow escape from a bad situation (they have two relatively high end, relatively expensive routing switches in the room). For us it's unpleasant, but we have spares (well, had spares, many of them are now deployed).

For the researchers it's been catastrophic. It's now almost two weeks since the incident and their machines are still off, just sitting there. At least some of them may never be powered on again. A lot of their hard drives are probably dead, along with some unknown amount of other equipment like switches and KVMs. It's almost certain to be more weeks before there's any prospect of reassembling a running cluster. In a very real way they've lost the entire machine room for weeks.

As I've watched things unfold and periodically gone by the machine room to see that things are still powered off, I can't help but think uncomfortable thoughts. Our machine room is about a hundred yards away from this machine room. It has sprinklers, many of them at least as old as the one that activated. This could have happened to us.

We could be looking at all of our central machines being down for weeks; all of our fileservers, all of our backends, all of our login servers, all of the firewalls and routers, everything. We could be looking at trying to glue together some sort of vaguely functional environment on a crash basis using whatever spare hardware we can scrounge up or beg from people. We could be trying to prioritize what services come back and who gets their data restored versus who has to wait until we have enough disk space to hold it all.

I look at the EE machine room, and I can't help but thinking 'thank whatever that it's not us'.

sysadmin/DisasterViewReflections written at 23:09:21; Add Comment

Unit testing by analogy to scientific hypotheses

In the popular and currently dominant view of how to consider whether something is a proper scientific hypothesis, an important criteria is falsifiability. To simplify a great deal, you test a scientific hypothesis not just by looking for what it says should be there but also by looking for what it says should not be there. If the hypothesis is 'all swans are white' you don't just look for white swans, you also look for ones that are not white.

Let us consider a theoretical function that returns True if a number is a prime (and False if it is not). We need to write a test for this function, so we fire up an editor:

def testPrimeness():
  for i in 2, 3, 5, 7, 883:
    mustBeTrue(isprime(i))

We're done, right? (Ignoring that this is only a short list of primes.)

No, not at all. What we've done is the testing equivalent of only looking for white swans. We need to also see if there are any black swans around by testing to see if the function returns False for numbers that are not prime.

Another way to look at this is that we are implicitly testing the wrong hypothesis. The hypothesis that this test checks is that isprime() returns True for prime numbers, but this is not the correct hypothesis; the actual specification is that it returns True only for prime numbers. Although it's not literally the case, we have essentially formed a non-falsifiable hypothesis without noticing and are cheerfully testing it.

It's my gut feeling that this is a relatively easy testing mistake to fall into. It's human nature (or at least our cognitive biases) to look for confirmation of what we think is the case, so we verify that isprime() returns True for primes and forget the other half of the specification.

There's a variant of this hypothesis falsification approach for test planning. One way to form tests is to imagine a whole series of hypotheses about how the function might work incorrectly and then attempt to falsify each one of them with a test. For example, I have two such falsification checks in the list of test primes (2 and 883), and a test series for mustBeFalse(isprime(n)) would likely throw in testing odd numbers as well as even ones.

(Checking the proper handling of corner cases is one common instance of this.)

This is of course closely related to testing your error paths, and I've probably written about bits of it in passing in other entries that I can't find right now.

Update: corrected an embarrassing error in my test. You can read about it in the comments.

programming/FalsifiableUnitTests written at 00:12:58; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.