A tale of network horror, or at least excitement

October 16, 2009

(This story comes from my co-worker John Calvin, who told it to me some years ago; I was reminded of it by some recent local events, so it seems like a good time to put it here.)

One of the things that the central computing people here can do for departments is run their basic networking infrastructure, the switches and wiring and so on. Once upon a time, such a managed departmental switch started lighting up the monitoring system with repeated, frequent contact failures; when the monitoring system went to poll the switch, it often wouldn't respond.

(It also often didn't respond on the telnet-based management console.)

Normally this means a failing switch. But this switch didn't seem to be dying; the department wasn't reporting any network issues, and when you could talk to the switch, it would report no errors or problems. It was just that fairly often, it wouldn't talk to the monitoring system. Various people got pulled in to try to figure out what was wrong, and what could be done about it, and finally they found it.

The switch was configured with two VLANs, an 'inside' and an 'outside', because the department had been planning to introduce a firewall. However, they hadn't gotten around to doing so, and in the mean time they'd simply used a network cable to directly connected what would have been the firewall's inside and outside network ports. Let us call these ports A (on the outside VLAN) and B (on the inside VLAN).

Switches need to maintain a mapping between Ethernet addresses and ports that they're reached on (otherwise they turn into hubs). As it happens, this switch only had a single global mapping table, not a per-VLAN mapping table, and the mapping was maintained by the switch's management processor, not its core switching engine.

(Roughly speaking, switches are divided into a high-speed switching engine and a slower management processor. The switching engine directly handles simple things and defers more complicated situations to the management processor, which is also responsible for answering SNMP queries and so on.)

So imagine what happens when a packet from the network router flows through the switch to an internal port. First, the switch sees a packet from the router's MAC on the router port, so it learns that MAC/port association. The packet then goes out port A so it can hop between the outside and inside VLANs, and suddenly the switch sees a packet from the router's MAC on port B. Since the switch only has a single global mapping table, it must now remove the old association of that MAC with the router port and add a new one associating it with port B. This entire port association flip-flop repeats for every packet from the outside world to a local machine, and it also happens in reverse for every packet from a local machine to the outside world (as first the switch sees the machine's MAC on its actual port, and then on port A). And every flip-flop has to be handled by the management processor.

As it happened, the management processor was basically melting down under the load of handling all of these flip-flips. When this happened, the management processor did the sensible thing and devoted all of its CPU power to the high-priority task of mapping table maintenance, and dropped lower-priority jobs on the floor, jobs such as responding to SNMP queries or to the management console.

(I should note that this was not a cheap switch; this was just quite a while ago, back when gigabit was an expensive novelty, 100 Mbits was pretty fast, and mammoths had just stopped roaming the earth.)

Written on 16 October 2009.
« One complexity of buffered IO on Unix
Automated web software should never fill in the Referer header »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Oct 16 02:20:12 2009
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.