Our SunFire X2100 nVidia Ethernet experiences
The SunFire X2100s have nVidia motherboards and four onboard Ethernet ports, two nVidia based ones and two Broadcomms. In our configuration, one Broadcomm and one nVidia port are used for iSCSI networking, the other nVidia port is used for general system access, and the second Broadcomm port is used only by the integrated service processor. Only the ports used for iSCSI see any significant traffic volume.
What I ran into was that under heavy streaming iSCSI IO, in other words more or less continuous TCP at close to wire rates, the nVidia iSCSI port would start reporting:
kernel: eth2: too many iterations (6) in nv_nic_irq.
When this happened, network activity on that port either dropped significantly or stopped entirely, with bad overall effects on iSCSI data rates. The Broadcomm iSCSI port had no problems, despite seeing the same level of traffic.
My solution was to take a club to the situation by setting a module
parameter to suppress the situation; in
/etc/modprobe.conf I set:
options forcedeth max_interrupt_work=100
This seems to have made the problem go away; certainly we don't see either the kernel message or network slowdowns any more, including under sustained IO loads.
(Note that we are using the default forcedeth kernel driver, in specific whatever version is included in the kernel.org 188.8.131.52 kernel; it appears that this is version 0.61.)
Sidebar: some references
I haven't found anything that really explains what's going on, assuming that there's even a common cause across all of the cases. Given that this is various versions of potentially buggy hardware combined with a reverse engineered driver (because nVidia has been less than helpful), there are a lot of potential problems and causes.