My theory on how network loop caused the problem we observed

January 17, 2016

Yesterday I described how a network loop on an wiring closet leaf switch in our port isolated network caused replies to the gateway's ARP requests to usually disappear (although not always). This is a fairly weird and mysterious symptom, but as it happens I have a wild theory about why (or how) it did.

The big characteristic of a port isolated network is that in fact most unicast traffic is supposed to disappear if you try to send it out. Hosts inside the port isolated network are only allowed to talk to hosts outside, while traffic to other hosts inside is dropped. Mechanically this is implemented inside the switches by rules on what ports are allowed to talk to each other with unicast traffic; non-uplink ports can only talk to the uplink port, while the uplink port can talk to anything.

(Broadcast traffic is flooded through the entire network, as is traffic for unknown MACs.)

This creates a very simple way to cause unicast traffic to be dropped on a port isolated network: do something to cause the network to believe that the destination MAC is an 'inside' host. So how do you do that? Well, switches learn MAC associations based on what port they see inbound traffic from a MAC on. So suppose you have a network loop at the bottom of your network hierarchy, and an 'outside' port sends out a broadcast packet. The packet will cascade down your tree, with each switch learning that the MAC is found on the uplink port, but then it bottoms out at the loop and gets re-injected into a leaf switch. As we've seen, this causes the leaf switch to change the MAC to port association; it then passes the broadcast back out its uplink port, and the packet re-floods through the entire network again with all switches flipping their MAC association to 'oh, it's coming from an internal port'. If an internal host sends out a unicast packet to that MAC shortly afterwards (say as an almost-instant reply to an ARP request), the switches will see this as an 'inside to inside' packet and drop it due to port isolation. The next packet from the outside host will start resetting the MAC port associations in switches back to recognizing it as an outside host, although it probably won't reach all of them.

This is a nice theory, but what I don't have an explanation for is why the network didn't blow up with endlessly repeated broadcast packets (such as broadcast ARP requests or DHCP queries). Looking back we saw some things that might have been signs of repeated packets, but certainly there was no flood of traffic; that would have been a glaringly damning sign right off the bat.

(If this explanation is correct, it also suggests a monitoring measure. If you can monitor MAC to port associations on your top level port isolated switch, just alarm on the MAC of any 'outside' machine switching to an 'inside' port, or more generally any MAC association flipping back and forth between the uplink port and a non-uplink port. Sadly I suspect that we can't do this on our current switches.)


Comments on this page:

By Ewen McNeill at 2016-01-17 03:33:55:

As a possible theory, broadcast packets may be software switched in the port isolated case (at least those that, eg, already need to be examined for MAC learning). This would potentially have the effect that there is a rate limit on the forwarding of things like ARP. Which in turn would stop a broadcast storm from saturating the network. (A number of modern switches also do rate limit broadcast packets.). I know that I found a client network with circulating broadcasts that were looping (despite spanning tree: a different, unknown, type of spanning tree for the devices in use), and they were noticeable but not saturating the network. Plus a noticeable switch CPU usage drop when they were drained from the network.

FWIW on some switches it's possible to get, eg, the MAC table via SNMP. Possibly that could be monitored for some stable MACs -- eg, the gateways.

Ewen

what I don’t have an explanation for is why the network didn't blow up with endlessly repeated broadcast packets

I think you just gave the reason why? I.e.: because of the port isolation. That’s if I’ve properly followed your explanation.

As I understood your hypothesis, it implies that every broadcast packet that makes it through the network will automatically get the sending port marked as “inside” – since every such packet will eventually reach the loop and get reinjected. And once it’s marked as coming from inside, it will get dropped, just like any inside-to-inside broadcast, killing off any further repetition.

So you should expect broadcast traffic to be amplified no more than at most twice.

No?

By cks at 2016-01-18 11:03:41:

Aristotle is right here, as I discovered after re-checking the behavior we actually see on our network. For some reason I though that only unicast packets were port isolated and broadcast packets from inside ports went everywhere, but this is incorrect; broadcast packets from 'inside' hosts are also not propagated to other inside hosts, only up to the outside ones. So even outside hosts will see only one extra copy of a broadcast packet from an outside host (and inside ones will see none).

(We've turned over network hardware over the past few years, so it's possible that this behavior has changed since I first looked at it years ago. It's certainly made our port isolated network much quieter for end hosts than I remember it being in the past.)

I wonder if a broader takeaway here might be that (strict?) port isolation converts some packet storm situations into vanishing packet situations.

(Sort of akin to the observation that garbage collection converts many crashes into memory leaks. Or more abstractly, that some architectural choices can convert a certain class of bug to another class with different severity, such that it changes the character of the system, in both operation and debugging.)

Written on 17 January 2016.
« Network loops can have weird effects (at least sometimes)
A limitation of tcpdump is that you can't tell in from out »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sun Jan 17 02:02:07 2016
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.