Why we don't have management connections to our switches (an old story)

May 24, 2021

One of the unusual things about our physical network implementation is that while we have managed switches, we don't have management connections to them, at least over the network (we do often have serial console connections to them). Since we don't have a network management connection, we can't use SNMP to pull data from our switches, which is somewhat inconvenient now that we have a Prometheus and Grafana setup; it would be nice to know things like per-port bandwidth and so on. While there are additional reasons to avoid networked management connections for some switches, our not using management connections has a lot to do with a bad experience with them years ago. This is the story of that bad experience.

(On our expensive 10G switches, we don't have the spare ports to dedicate one to a management connection.)

Our bad experience happened back in the era of our second network implementation, where many of our then new 24-port switches carried multiple VLANs and split them out on a per-port basis. These switches were interconnected in a tree for regular VLANs, and their management port was connected to an entirely physically separate management network that ran over its own simple non-VLAN'd, single network switch fabric using dedicated inter-building links. This setup meant that we had multiple paths between switches, but they were on different VLANs and so not subject to cross-talk or packet loops (there was one path through the regular VLAN'd tree, and one path through the management network).

One day we noticed that packets from one subnet were showing up on another subnet. This happens from time to time and is usually caused by someone connecting two drops together when they shouldn't be, for example by plugging both into a little desktop switch. What was weird about this case was that it wasn't all packets from the subnet, it was only occasional packets. When we traced through all of the switch and port fabric, we couldn't find any crossed-over drops or ports. Even unplugging bits of the network didn't stop these occasional packets showing up where they shouldn't be, but it did establish that they behaved very weirdly; for instance, packets generated in one building would prove to be injected on the other subnet in a completely different building. Eventually we noticed that these packets were appearing on the management network, in addition to their two regular subnets, which gave us the necessary clue.

What was really happening was that our 24-port switches seemed to have some sort of flaw when handling their management network. If the management network was connected up, sometimes a switch would decide to take a regular untagged packet on another VLAN and inject it into the management network, and also sometimes take an untagged packet from the management network and inject it into some other VLAN. Since our management network had no regular traffic of its own, almost all of the traffic available to be switched over to other VLANs was traffic that had started out on other VLANs and had been injected into the management network. The combination of both happening to the same packet would cause a packet to apparently teleport from one place on one network in one building to a completely different place on another network in another building.

Our solution was to remove all of our switches from our management network. For the remainder of their life as multi-VLAN switches (until we shifted over to our third network implementation), they were configured either through their serial port or by connecting a laptop directly to their management port.

Ever since then, we've never connected another switch's management port to our management network. Maybe it would work these days, or maybe there are other problems lurking, waiting to be discovered. So far, we don't need to talk to our switches remotely over the network badly enough to risk it.

(For the multi-VLAN switches at the core of our networks, we do talk to their serial consoles remotely through our serial console infrastructure.)

Written on 24 May 2021.
« Our three generations of network implementations (over the time I've been here)
Rust is a wave of the future »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon May 24 00:07:49 2021
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.