Network loops can have weird effects (at least sometimes)
Today we had a weird network problem on one of our most crucial networks, our port-isolated user machine network; this is the wired network used to connect laptops, most machines in people's offices, and so on. The only failure we could really see was that when the gateway firewall sent out a (broadcast) ARP request for a given IP, it would not see the (unicast) ARP reply from your machine. If your machine did something that caused the gateway to pick up its MAC, everything worked. Manually delete the ARP entry on the gateway, and the problem would be back. And rarely (often taking many minutes) an ARP reply would make it to the gateway and poof, everything would work again for your machine for a while until your ARP entry fell out of the gateway's ARP table.
There were several oddities about this. The biggest is that only
ARP replies were affected; you could, for example, ping back and
forth between your machine and elsewhere as long as the gateway had
you in its ARP table. Nor did we see any unusual network traffic
during this. We've seen our networks melt down on occasion (including
this one), with things like traffic floods, rogue DHCP servers, and
packet echoes, but nothing odd showed up in
tcpdump from multiple
vantage points. If anything maybe there was less extraneous broadcast
babbling than usual.
Given 'some packets are vanishing', we initially suspected malfunctioning switches; we've seen various downright weird things when this happens. So we swapped in spares for core top level switches (they) were basically the only common point in the switch fabric between all of the machines that were seeing problems) and of course nothing happened. It wasn't the gateway, because we could reproduce the problem with a number of other machines in the same top level network position (such as the DHCP server). We scratched our heads a lot, or at least I did, and eventually resorted to brute force instead of trying to come up with theories about what had broken how: as I mentioned on Twitter, we started systematically disconnecting bits of the network from the top down to see what had to be connected to make things go wrong.
As you already know from the title of this entry, the problem turned out to be a network loop. At the very periphery of the network (in one of the department's office areas), someone had plugged a little 5-port switch into two network drops at once, thereby creating a loop between two ports on one of our wiring closet leaf switches. This simple single-switch cross-connect was the root cause of all of our network problems.
Looking back at it after the fact, I can construct a theory about how this cross-connect caused the observed problems (although I have no idea if it's correct). But at the time I wouldn't have at all expected to see these symptoms from a network loop. So my moral for today is that the symptoms of network loops can be quite weird and not what I expect at all.
(For reasons beyond the scope of this entry, we do not have STP enabled on our switches. Under normal circumstances it's unnecessary, as all of our networks are strict (acyclic) trees.)