2019-07-13
Our switches can wind up in weird states after a power failure
We've had two power failures so far this year, which is two more than we usually have. Each has been a learning experience, because both times around our overall environment failed to come back up afterward. The first time around the problem was DNS, due to a circular dependency that we still don't fully understand. The second time around, what failed was much more interesting.
Three things failed to come back up after the second power failure.
The more understandable and less fatal problem was that our OpenBSD
external bridging firewall needed some manual attention to deal
with a fsck
issue. By itself this just cut us off from the external
world. Much worse, two of our core switches didn't fully boot up;
instead, they stopped in their bootloader and waiting for someone
to tell them to continue. Since the switches didn't boot and apply
their configuration, they didn't light up their ports and none of
our leaf switches could pass traffic around. The net effect was to
create little isolated pools of machines, one pool per leaf switch.
(Then naturally most of these pools didn't have access to our DNS servers, so we also had DNS problems. It's always DNS. But no one would have gotten very far even with DNS, because all of our fileservers were isolated on their own little pool on a 10G-T switch.)
We've never seen this happen before (and certainly it didn't happen in prior power outages and scheduled shutdowns), so we've naturally theorized that the power failure wasn't a clean one (either during the loss of power or when it came back) and this did something unusual to the switches. It's more comforting to think that something exceptional happened than that this is a possibility that's always lurking there even in clean power loss and power return situations.
(While we shut down all of our Unix servers in advance for scheduled power shutdowns, we've traditionally left all of our switches powered on and just assumed that they'd come back cleanly afterward. We probably won't change that for the next scheduled power shutdown, but we may start explicitly checking that the core switches are working right before we start bringing servers up the next day.)
That we'd never seen this switch behavior before also complicated our recovery efforts, because we initially didn't recognize what had gone wrong with the switches or even what the problem with our network was. Even once my co-worker recognized that something was anomalous about the switches, it took a bit of time to figure out what the right step to resolve it was (in this case, to tell the switch bootloader to go ahead and boot the main OS).
(The good news is that the next time around we'll be better prepared. We have a console server that we access the switch consoles through, and it supports informational banners when you connect to a particular serial console. The consoles for the switches now have a little banner to the effect of 'if you see this prompt from the switch it's stuck in the bootloader, do the following'.)
PS: What's likely booting here is the switch's management processor. But the actual switching hardware has to be configured by the management processor before it lights up the ports and does anything, so we might as well talk about 'the switch booting up'.