A theory about our jumbo frame switch firmware bug
Last entry I mentioned that I now had a theory about our odd switch failure with jumbo frames, where after a power cycle the switch would start doing jumbo frames remarkably slowly until you went into the configuration system and re-selected the 'do jumbo frames' option. This is theory.
As I've mentioned before, modern switches have two parts; a high speed switching core and a slower management processor that handles everything else. If the jumbo frames weren't being handled by the switching core but were instead being passed up to the management processor, you could expect things to work but be very slow, which is just what I saw.
So how could things get that way? My theory is that the code that configured the switching core on boot was doing an incomplete job of enabling jumbo frames; it told the switching core to accept them, but didn't turn on everything that was needed to have the switching core actually switch them. The code that got run when you turned on jumbo frames in the configuration system did do the full setup, hence explicitly 'enabling' jumbo frames in the configuration interface suddenly making them work at full speed.
(This theory also leads to a decent story about how the switch passed the vendor's testing, since most testing starts from factory default settings.)
One of the things that this reinforces for me is that modern hardware is not just hardware; it has a lot of non-trivial software embedded into it. This matters because software generally has much more complicated failure modes than physical hardware, which means that even what we think of as simple hardware can behave very oddly in narrow circumstances.
(The poster child for this is hard drives, which now run a scarily large amount of onboard code to do increasingly sophisticated processing, more or less behind your back. All things considered, I am sometimes impressed that modern HDs work anywhere near as well as they do.)
|
|