My theory on how network loop caused the problem we observed
Yesterday I described how a network loop on an wiring closet leaf switch in our port isolated network caused replies to the gateway's ARP requests to usually disappear (although not always). This is a fairly weird and mysterious symptom, but as it happens I have a wild theory about why (or how) it did.
The big characteristic of a port isolated network is that in fact most unicast traffic is supposed to disappear if you try to send it out. Hosts inside the port isolated network are only allowed to talk to hosts outside, while traffic to other hosts inside is dropped. Mechanically this is implemented inside the switches by rules on what ports are allowed to talk to each other with unicast traffic; non-uplink ports can only talk to the uplink port, while the uplink port can talk to anything.
(Broadcast traffic is flooded through the entire network, as is traffic for unknown MACs.)
This creates a very simple way to cause unicast traffic to be dropped on a port isolated network: do something to cause the network to believe that the destination MAC is an 'inside' host. So how do you do that? Well, switches learn MAC associations based on what port they see inbound traffic from a MAC on. So suppose you have a network loop at the bottom of your network hierarchy, and an 'outside' port sends out a broadcast packet. The packet will cascade down your tree, with each switch learning that the MAC is found on the uplink port, but then it bottoms out at the loop and gets re-injected into a leaf switch. As we've seen, this causes the leaf switch to change the MAC to port association; it then passes the broadcast back out its uplink port, and the packet re-floods through the entire network again with all switches flipping their MAC association to 'oh, it's coming from an internal port'. If an internal host sends out a unicast packet to that MAC shortly afterwards (say as an almost-instant reply to an ARP request), the switches will see this as an 'inside to inside' packet and drop it due to port isolation. The next packet from the outside host will start resetting the MAC port associations in switches back to recognizing it as an outside host, although it probably won't reach all of them.
This is a nice theory, but what I don't have an explanation for is why the network didn't blow up with endlessly repeated broadcast packets (such as broadcast ARP requests or DHCP queries). Looking back we saw some things that might have been signs of repeated packets, but certainly there was no flood of traffic; that would have been a glaringly damning sign right off the bat.
(If this explanation is correct, it also suggests a monitoring measure. If you can monitor MAC to port associations on your top level port isolated switch, just alarm on the MAC of any 'outside' machine switching to an 'inside' port, or more generally any MAC association flipping back and forth between the uplink port and a non-uplink port. Sadly I suspect that we can't do this on our current switches.)