Hardware can be weird, Intel 10G-T X540-AT2 edition
Every so often I get a pointed reminder that hardware can be very weird. As I mentioned on Twitter today, we've been having one of those incidents recently. The story starts with the hardware for our new fileservers and iSCSI backends, which is built around SuperMicro X9SRH-7TF motherboards. These have an onboard Intel X540-AT2 chipset that provides two 10G-T ports. The SuperMicro motherboard and BIOS lights up these ports no later than when you power the machine on and leave it sitting in the BIOS, and maybe earlier (I haven't tested).
On some but not all of our motherboards, the first 10G-T port lights up (in the BIOS) at 1G instead of 10G. When we first saw this on a board we thought we had a failed board and RMA'd it; the replacement board behaved the same way but when we booted an OS (I believe a Linux) the port came up at 10G and we assumed that all was well. Then we noticed that some but not all of our newly installed OmniOS fileservers had their first port (still) coming up at 1G. At first we thought we had cable issues, but the cables were good.
In the process of testing the situation out, we rebooted one OmniOS fileserver off a CentOS 7 live cd to see if Linux could somehow get 10G out of the hardware. Somewhat to my surprise it could (and a real full 10G at that). More surprising, the port stayed at 10G when we rebooted into OmniOS. It stayed at 10G in OmniOS over a power cycle and it even stayed at 10G after a full power off where we cut power to the entire case for several minutes. Further testing showed that it was sufficient merely to boot the CentOS 7 live cd on an affected server without ever configuring the interface (although it's possible that the live cd configures the interface up to try DHCP and then brings it down again).
There's a lot of weirdness here. It'd be one thing for the Linux driver to bring up 10G where the OmniOS one didn't; then it could be that the Linux driver was more comprehensive about setting up the chipset properly. For it to be so firmly persistent is another thing, though; it suggests that Linux is reprogramming something that stays programmed in nonvolatile storage. And then there's the matter of this happening only on some motherboards and only to one port out of two that are driven by the same chipset.
Ultimately, who knows. We're happy because we apparently have a full solution to the problem, one we've actually carried out on all of the machines now because we needed to get them into production.
(As far as we can easily tell, all of the motherboards and the motherboard BIOSes are the same. We haven't opened up the cases to check the screen printing for changes and aren't going to; these machines are already installed and in production.)
Comments on this page:Written on 08 August 2014.