Our current problems with 10G Intel networking on OmniOS

November 15, 2014

In my writeup on our new OmniOS fileservers I mentioned that we had built them out with 10G-T networking for their iSCSI networking (using onboard Intel X540-AT2 based ports) and their public NFS interface (using one port of a dual-port Intel 82599EB TN card). Since then, well, things have not gone so well and in fact we're in the process of moving all production fileservers to 1G networking until we can understand what's going on and we can fix it.

The initial problems involved more or less total server lockups on our most heavily used fileserver. Due to some warning messages on the console and previous weird issues with onboard ports, we added a second dual-port card and moved the iSCSI networks to them. We also had iSCSI networking issues on two other servers, one of which was also switched to use a second dual-port card for iSCSI networking.

(At this point the tally is two fileservers using the onboard ports for 10G iSCSI and two fileservers using second dual-port cards for it.)

The good news is that the fileservers mostly stopped locking up at this point. The bad news is that both actively used dual-port cards wound up getting themselves into a state where the ixgbe driver couldn't talk properly to the second port and this had very bad effects, most specifically an extremely long lock hold time with spinlocks. At first we saw this only with the first card that had been replaced, on our most-used fileserver, so it was possible for me to believe that this was just a hardware fault (after all, the second port was working fine on the less used fileserver). Today we had exactly the same issue appear on the other fileserver, so it seems extremely likely that there is some combination of a driver bug and a hardware bug involved, one that is probably more and more likely to manifest as you pass more traffic through the ports.

(On top of that problem, we also found a consistent once a second 20msec lock hold time and stall in the ixgbe driver when dealing with those onboard X540-AT2 ports. Interested parties are directed to this email to the illumos-developer mailing list for full details about both issues. Note that it was written when I still thought the big stall might be due to faulty hardware on a single card.)

My understanding is that the Illumos (and thus OmniOS) ixgbe driver is derived from an upstream general Intel driver through a process that must be done by hand and apparently has not happened for several years. At this point enough bad stuff has shown up that I don't think we can trust the current OmniOS version of the driver and we probably don't want to try Intel 10G-T again until it's updated. Unfortunately I don't have any idea if or when that will happen.

(It also seems unlikely that we'll find any simple or quick reproduction for any of these problems in a test environment. My suspicion is that the dual-port issue is due to some sort of narrow race window involving hardware access, so it may depend not just on total traffic volume but on the sort of traffic you send.)

Written on 15 November 2014.
« Sometimes there are drawbacks to replicating configuration files
States in a state machine aren't your only representation of state »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sat Nov 15 01:10:10 2014
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.