Network loops can have weird effects (at least sometimes)

January 15, 2016

Today we had a weird network problem on one of our most crucial networks, our port-isolated user machine network; this is the wired network used to connect laptops, most machines in people's offices, and so on. The only failure we could really see was that when the gateway firewall sent out a (broadcast) ARP request for a given IP, it would not see the (unicast) ARP reply from your machine. If your machine did something that caused the gateway to pick up its MAC, everything worked. Manually delete the ARP entry on the gateway, and the problem would be back. And rarely (often taking many minutes) an ARP reply would make it to the gateway and poof, everything would work again for your machine for a while until your ARP entry fell out of the gateway's ARP table.

There were several oddities about this. The biggest is that only ARP replies were affected; you could, for example, ping back and forth between your machine and elsewhere as long as the gateway had you in its ARP table. Nor did we see any unusual network traffic during this. We've seen our networks melt down on occasion (including this one), with things like traffic floods, rogue DHCP servers, and packet echoes, but nothing odd showed up in tcpdump from multiple vantage points. If anything maybe there was less extraneous broadcast babbling than usual.

Given 'some packets are vanishing', we initially suspected malfunctioning switches; we've seen various downright weird things when this happens. So we swapped in spares for core top level switches (they) were basically the only common point in the switch fabric between all of the machines that were seeing problems) and of course nothing happened. It wasn't the gateway, because we could reproduce the problem with a number of other machines in the same top level network position (such as the DHCP server). We scratched our heads a lot, or at least I did, and eventually resorted to brute force instead of trying to come up with theories about what had broken how: as I mentioned on Twitter, we started systematically disconnecting bits of the network from the top down to see what had to be connected to make things go wrong.

As you already know from the title of this entry, the problem turned out to be a network loop. At the very periphery of the network (in one of the department's office areas), someone had plugged a little 5-port switch into two network drops at once, thereby creating a loop between two ports on one of our wiring closet leaf switches. This simple single-switch cross-connect was the root cause of all of our network problems.

Looking back at it after the fact, I can construct a theory about how this cross-connect caused the observed problems (although I have no idea if it's correct). But at the time I wouldn't have at all expected to see these symptoms from a network loop. So my moral for today is that the symptoms of network loops can be quite weird and not what I expect at all.

(For reasons beyond the scope of this entry, we do not have STP enabled on our switches. Under normal circumstances it's unnecessary, as all of our networks are strict (acyclic) trees.)

Written on 15 January 2016.
« Things I learned from OpenSSH about reading very sensitive files
My theory on how network loop caused the problem we observed »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Jan 15 23:36:18 2016
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.