A fun tale of network troubleshooting involving VLANs and MACs
The following is not my story; it comes from my co-workers, who had the real fun of trying to figure this one out and then finding a fix.
To start with, let's set the background. We have an (OpenBSD) routing firewall machine that sits on a network segment whose egress router is not under our control. Actually, we have two of them, one active and one as a warm spare (it's on and being updated, but it is not connected to any of the production networks because otherwise it would fight the live firewall for the public IPs). A while back, as part of trying to fail over from the live machine to the warm spare, we discovered that the egress router for the network caches ARP information for a long time. Like, apparently, hours. This was obviously no good for being able to switch over (such as in the case of hardware failure). Since the egress router is not under our control, the only thing we could really do was explicitly set the warm spare to have the same Ethernet address as the active machine.
(This was tested at the time it was set up and worked, but we believe the test at the time was misleading.)
Recently my co-workers wanted to swap from the active machine to the warm spare, because the active machine had been up for literally years (we don't update OpenBSD all that often). Unfortunately, when they made the swap the (ex-)warm spare was not reachable on its public IP, so they failed back to the active machine and took the warm spare off for testing. Testing established that the warm spare was showing 'incomplete' for other machines in its ARP cache, although other machines picked it up fine for their ARP caches. Further, trying to inspect the traffic with tcpdump made the network suddenly work, but things broke again when they stopped tcpdump. Oh, and the problem was specific to using our preferred Intel Ethernet cards; if the warm spare was switched to use non-Intel network hardware, everything worked.
Now, it happens that this machine has a slightly unusual network configuration. Because it needs to talk to a number of external networks, it actually gets all of its external networks as tagged VLANs over a single physical network port. When we changed the machine to use the MAC of the active machine, we had set the Ethernet address on the VLAN for that particular network, because that was the network that mattered; we didn't change the MAC of anything else.
It turned out that this was the problem. Using Intel cards on our (old) version of OpenBSD, when the MAC of the VLAN differed from the MAC of the underlying physical interface and the interface was not in promiscuous mode, ARP (at least) didn't work because the kernel apparently never received the replies to its ARP queries. If you put the interface into promiscuous mode, such as by running tcpdump, things suddenly worked; the kernel received ARP replies and so on. We think that the whole setup worked when tested because we likely tested it with tcpdump running to watch traffic (and verify what MACs were being used).
(The obvious suspect here is hardware level receive filtering; perhaps the hardware is only being set by the driver to recognize the physical port MAC as its MAC. This is a driver and/or hardware issue, but these things happen.)
Once my co-workers figured out what the problem was, the fix was simple: explicitly set the MACs of both the physical port and all the VLANs on it to the active machine's MAC. But getting there took a whole frustrating and puzzling journey. This wasn't exactly a Heisenbug, but until my co-workers noticed the pattern that running tcpdump made it disappear it did look like one.
tcpdump -p' is the obvious thing for the future, but I
don't know if it would actually have worked in this situation.
Still, it's something to try to remember for the next time around.
Maybe tcpdump should default to
-p these days.)