2012-03-23
Sometimes you get lucky
We had a building power failure today in the building with our main machine room (and thus all of our core servers). When we realized what was going on and got to the machine room, we made an extremely unpleasant discovery; as far as we can tell, the automatic transfer switches in front of our UPSes, well, didn't transfer. Instead they all entered some sort of faulted state where they provided no power.
(The UPSes all at least claimed to have good battery charge and to not have run down, although it's possible they were lying and they had all died by the time we got there despite appearing healthy.)
This was very bad. The automatic transfer switches and UPSes are our primary defense against ZFS metadata corruption during power failures; with them non-functional we were completely exposed. When the power returned, we restarted the fileservers one by one and held our breath as each ZFS pool and its filesystems came up. In the end, all of our pools survived.
(We expect to find a number of repairable checksum errors when we scrub all pools this weekend.)
I have no great lesson here beyond what's in the title: sometimes you get lucky. Good luck happens just as much as bad luck, and it's what we had today.
(If you want a small lesson, I think it's that testing what will happen in a real power failure may be surprisingly hard. Real power failures seem to involve all sorts of things happening to the line power that are not necessarily very much like what happens when you pull a power cord out or flip a PDU master power switch.)