Our likely long road to working 10G-T on OmniOS

December 19, 2014

I wrote earlier about our problems with Intel 10G-T on our OmniOS fileservers and how we've had to fall back to 1G networking. Obviously we'd like to change that and go back to 10G-T. The obvious option was another sort of 10G-T chipset besides Intel's. Unfortunately, as far as we can see Intel's chipsets are the best supported option and eg Broadcom seems even less likely to work well (or at all, and we later had problems with even a Broadcom 1G chipset under OmniOS). So we've scratched that idea; at this point it's Intel or bust.

We really want to reproduce our issues outside of production. While we've set up a test environment and put load on it, we've so far been unable to make it fall over in any clearly networking related way (OmniOS did lock up once under extreme load, but that might not be related at all). We're going to have to keep trying in the new year; I don't know what we'll do if we can't reproduce things.

(We also aren't currently trying to reproduce the dual port card issue. We may switch to this at some point.)

As I said in the earlier entry, we no longer feel that we can trust the current OmniOS ixgbe driver in production. That means going back to production needs an updated driver. At the moment I don't think anyone in the Illumos community is actively working on this (which I can't blame them for), although I believe there's some interest in doing a driver update at some point.

It's possible that we could find some money to sponsor work on updating the ixgbe driver to the current upstream Intel version, and so get it done that way (assuming that this sort of work can be sponsored for what we can afford, which may be dubious). Unfortunately our constrained budget situation means that I can't argue very persuasively for exploring this until we have some confidence that the current upstream Intel driver would fix our issues. This is hard to get without at least some sort of reproduction of the problem.

(What this says to me is that I should start trying to match up driver versions and read driver changelogs. My guess is that the current Linux driver is basically what we'd get if the OmniOS driver was resynchronized, so I can also look at it for changes in the areas that I already know are problems, such as the 20msec stall while fondling the X540-AT2 ports.)

While I don't want to call it 'ideal', I would settle for a way to reproduce the dual card issue with simply artificial TCP network traffic. We could then change the server from OmniOS to an up to date Linux to see if the current Linux driver avoids the problem under the same load, then use this as evidence that commissioning an OmniOS driver update would get us something worthwhile.

None of this seems likely to be very fast. At this point, getting 10G-T back in six months seems extremely optimistic.

(The pessimistic view of when we might get our new fileserver environment back to 10G-T is obviously 'never'. That has its own long-term consequences that I don't want to think about right now.)

Sidebar: the crazy option

The crazy option is to try to learn enough about building and working on OmniOS so that I can build new ixgbe driver versions myself and so attempt either spot code modifications or my own hack testing on a larger scale driver resynchronization. While there is a part of me that finds this idea both nifty and attractive, my realistic side argues strongly that it would take far too much of my time for too little reward. Becoming a vaguely competent Illumos kernel coder doesn't seem like it's exactly going to be a small job, among other issues.

(But if there is an easy way to build new OmniOS kernel components, it'd be useful to learn at least that much. I've looked into this a bit but not very much.)

Comments on this page:

By Bacon at 2014-12-19 05:03:56:

We're successfully using 10g HP cards in our Linux based servers. HP seems to have good support for their 10G products. Only issue we had was the need to install the official HP drivers, but since were using a non-bleeding edge distro (RedHat 6.4), driver installation was smooth.

By cks at 2014-12-22 20:04:42:

As an update so I remember it: @ch2500 suggested on Twitter that another option is to run OmniOS virtualized on top of Linux, using virtualized networking drivers. This is probably a little bit too crazy for us (unless we get desperate), but it would be one way of dealing with OmniOS driver issues. While it'd probably cost us some theoretically available 10G performance, I'd expect it to give us more than 1G performance which is, after all, what we're getting now.

(I may actually experiment with this at some point just to see how hard it would be and how much performance we'd probably get or lose.)

By Vadim Comanescu at 2014-12-23 09:48:38:

Hi Chris,

Nice writeup. We at Syneto encountered this problem lately in production with some of our clients. We are still trying to find a solution to this problem although as you have mentioned to only viable one to go ahead with the Intel chipset cards would be to update the ixgbe driver. What we have tried is to put the 10g cards on different processors, to switch from Intel cards to Supermicro 10g cards with Intel chipset (AOC-STGN-i2S). So far the Supermicro cards are working fine although we believe it's a matter of time till the problem reproduces since it's using the exact same driver. Maybe the signalling is different on Supermicro card ...

At the moment the only viable solution we have tested and validated for 10g on illumos is using Celio 520-CR. The cxgbe driver works fine and is sustained by Celsio exactly for Illumos. It was written for Illumos from my understanding, it's not a FreeBSD port like the Intel one.

Let's hope we can sort this out soon. Cheers.

By cks at 2014-12-23 15:12:39:

It looks like the Chelsio products don't currently support 10G-T, which is a requirement for us. I suspect that they're also way out of our price range.

In general the situation with the ixgbe driver seems to depend very much on the specific chipset involved. While it is a unified driver, different chipsets are handled somewhat differently internally and I can believe that there's a chipset that it's okay with. Note that we're using 10G-T Intel, which I believe will generally be a somewhat different chipset than SFP+ 10G.

By Anonymous at 2014-12-30 06:48:52:

Solaris 11 x86 - $2k per year per 2-socket server for support.I wonder if it would solve your issues and if it wouldn't be actually cheaper for you, considering how much time and effort you are spending on the issue.

By cks at 2014-12-30 15:36:40:

In a university environment, staff time and effort are considered almost free for somewhat complicated reasons. The result is that we can spend man-months of time on this issue without blinking, but the budget does not have actual cash for large expenses.

I believe that there have been some significant updates for Intel 10GbE in the past year. It may be worth looking at again. Furthermore,I have Solarflare 10GbE (and 40GbE!) working in illumos now; waiting for code review feedback before proceeding to integration with that though.

By cks at 2016-03-29 12:10:18:

I may be looking in the wrong place, but based on git logs OmniOS r151014 hasn't had any ixgbe driver changes since 2014 and even the upstream Illumos has only one (this one, per this overview). In general I've been keeping an eye on all of the public Illumos-related trees that I know of and sadly I haven't seen anything go in for ixgbe.

Written on 19 December 2014.
« The potential end of public clients at the university?
Unsurprisingly, laptops make bad to terrible desktops »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Dec 19 01:01:42 2014
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.