Our likely long road to working 10G-T on OmniOS
I wrote earlier about our problems with Intel 10G-T on our OmniOS fileservers and how we've had to fall back to 1G networking. Obviously we'd like to change that and go back to 10G-T. The obvious option was another sort of 10G-T chipset besides Intel's. Unfortunately, as far as we can see Intel's chipsets are the best supported option and eg Broadcom seems even less likely to work well (or at all, and we later had problems with even a Broadcom 1G chipset under OmniOS). So we've scratched that idea; at this point it's Intel or bust.
We really want to reproduce our issues outside of production. While we've set up a test environment and put load on it, we've so far been unable to make it fall over in any clearly networking related way (OmniOS did lock up once under extreme load, but that might not be related at all). We're going to have to keep trying in the new year; I don't know what we'll do if we can't reproduce things.
(We also aren't currently trying to reproduce the dual port card issue. We may switch to this at some point.)
As I said in the earlier entry, we no longer feel that we can trust the current OmniOS ixgbe driver in production. That means going back to production needs an updated driver. At the moment I don't think anyone in the Illumos community is actively working on this (which I can't blame them for), although I believe there's some interest in doing a driver update at some point.
It's possible that we could find some money to sponsor work on updating the ixgbe driver to the current upstream Intel version, and so get it done that way (assuming that this sort of work can be sponsored for what we can afford, which may be dubious). Unfortunately our constrained budget situation means that I can't argue very persuasively for exploring this until we have some confidence that the current upstream Intel driver would fix our issues. This is hard to get without at least some sort of reproduction of the problem.
(What this says to me is that I should start trying to match up driver versions and read driver changelogs. My guess is that the current Linux driver is basically what we'd get if the OmniOS driver was resynchronized, so I can also look at it for changes in the areas that I already know are problems, such as the 20msec stall while fondling the X540-AT2 ports.)
While I don't want to call it 'ideal', I would settle for a way to reproduce the dual card issue with simply artificial TCP network traffic. We could then change the server from OmniOS to an up to date Linux to see if the current Linux driver avoids the problem under the same load, then use this as evidence that commissioning an OmniOS driver update would get us something worthwhile.
None of this seems likely to be very fast. At this point, getting 10G-T back in six months seems extremely optimistic.
(The pessimistic view of when we might get our new fileserver environment back to 10G-T is obviously 'never'. That has its own long-term consequences that I don't want to think about right now.)
Sidebar: the crazy option
The crazy option is to try to learn enough about building and working on OmniOS so that I can build new ixgbe driver versions myself and so attempt either spot code modifications or my own hack testing on a larger scale driver resynchronization. While there is a part of me that finds this idea both nifty and attractive, my realistic side argues strongly that it would take far too much of my time for too little reward. Becoming a vaguely competent Illumos kernel coder doesn't seem like it's exactly going to be a small job, among other issues.
(But if there is an easy way to build new OmniOS kernel components, it'd be useful to learn at least that much. I've looked into this a bit but not very much.)