Sometimes a problem really is just a coincidence
This past Wednesday, we did some maintenance, including precautionary reboots of some of our OpenBSD firewalls. All of our firewalls actually are pairs of machines, one of which is the active firewall and the other of which is the inactive spare (which is at most running pfsync, and is not on the live networks or doing anything). One of the first firewalls we rebooted was what was supposed to be the inactive spare of the bridging firewall that sits between our networks and the rest of the university. Less than a minute after the reboot was initiated, our monitoring system was screaming that we had basically lost all connectivity to the outside world.
Naturally people went digging to try to understand what had happened. We had not accidentally rebooted the live firewall instead of the inactive spare (an easier mistake to make with a bridging firewall than with a routing one), the reboot didn't seem to have somehow influenced the live firewall, our core router had not seen the interface status change, and so on and so forth. Later, we examined our reachability metrics in more detail (including data from an outside perspective) and became even more confused, especially since the reachability data from outside showed that we'd had problems accessing some things not even behind our bridging firewall.
I'll jump to the punchline: it was a coincidence. The overall university network had had some problems that happened to start only very shortly before the reboot of the inactive spare firewall (and by 'only very shortly' I mean less than 60 seconds before the reboot started). There may also have been a small power fluctuation in the building at around the same time, too. If the overall networking problems had dragged on the coincidence would have been more obvious, but instead they faded out within about six minutes of the inactive spare firewall being back up, which was well within the time period where the co-worker actually in the office was poking around at things and trying to figure out what was going on.
It wasn't necessarily wrong of us to immediately assume that the reboot of a firewall was the cause and to look into things around it; the sysadmin's version of Occam's Razor is that if you just did something and a problem shows up, your action is the most likely cause. Often it really is the cause. But not always, as we saw this time, so if things don't seem to make sense maybe we should also start thinking about possible alternate explanations (and where we'd find evidence for or against them).
(In this case, there was nothing we could do to fix the problem since it was outside of our network, so the time spent poking around didn't delay resolving the issue.)
(This elaborates on a tweet of mine.)
|
|