A theory about our jumbo frame switch firmware bug

May 16, 2010

Last entry I mentioned that I now had a theory about our odd switch failure with jumbo frames, where after a power cycle the switch would start doing jumbo frames remarkably slowly until you went into the configuration system and re-selected the 'do jumbo frames' option. This is theory.

As I've mentioned before, modern switches have two parts; a high speed switching core and a slower management processor that handles everything else. If the jumbo frames weren't being handled by the switching core but were instead being passed up to the management processor, you could expect things to work but be very slow, which is just what I saw.

So how could things get that way? My theory is that the code that configured the switching core on boot was doing an incomplete job of enabling jumbo frames; it told the switching core to accept them, but didn't turn on everything that was needed to have the switching core actually switch them. The code that got run when you turned on jumbo frames in the configuration system did do the full setup, hence explicitly 'enabling' jumbo frames in the configuration interface suddenly making them work at full speed.

(This theory also leads to a decent story about how the switch passed the vendor's testing, since most testing starts from factory default settings.)

One of the things that this reinforces for me is that modern hardware is not just hardware; it has a lot of non-trivial software embedded into it. This matters because software generally has much more complicated failure modes than physical hardware, which means that even what we think of as simple hardware can behave very oddly in narrow circumstances.

(The poster child for this is hard drives, which now run a scarily large amount of onboard code to do increasingly sophisticated processing, more or less behind your back. All things considered, I am sometimes impressed that modern HDs work anywhere near as well as they do.)

Written on 16 May 2010.
« Why we don't use jumbo frames for iSCSI: a cautionary tale on testing
You should also document why you didn't do attractive things »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sun May 16 00:55:22 2010
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.