Wandering Thoughts archives

2014-11-16

We've started to decommission our Solaris 10 fileservers

Our migration from our old fileservers to our new fileservers has been a slow process that's hit some rough spots. Still, we've hit a distinct point that I want to mark: this past week we reached the point where we did 'zpool destroy' on one old fileserver's old pools and powered down its backend disks.

While we can migrate rapidly when we need to, our decommissionings usually go slowly. Unless we need the rack space or the hardware for something, we mostly leave old servers and so on running until they get annoying for some reason. This makes the whole thing a gradual process, instead of the big bang that I expect some people have. In our case we actually started losing bits of the old fileserver environment almost a year ago, when we started removing bits of our old test environment. More recently we used our hot spare fileserver as a hardware donor during a crisis; we're unlikely to replace it, although we theoretically could.

(In fact we decommissioned this particular fileserver mostly because it was getting annoying. One of the disks in one of its backends failed, causing us to get plaintive email about the situation, so we decided that we wanted to power off all of the remaining disks to preserve them as potential future spares. Destroying the pools instead of just exporting them insures that the disks won't be seen as still in use by a ZFS pool if we reuse them later.)

On the one hand, this is a trivial step and the fileserver had already had its last filesystems migrated a week or so before this (and we've destroyed some migrated filesystems for various reasons, which is far more irreversible than 'zpool destroy'). On the other hand, it simply feels significant. In a way, it makes the whole thing real; we've decisively torn down what used to be a production fileserver. I can now really believe in a relatively near future where we have no Solaris 10 machines any more.

(It won't be an immediate future; given recent issues all remaining filesystem migrations to the new fileservers are probably on hold until January. In general we try not to do too much in late November and early December due to the university's long holiday around Christmas and the end of the year.)

SolarisDecommissioningStarts written at 22:18:21; Add Comment

2014-11-15

Our current problems with 10G Intel networking on OmniOS

In my writeup on our new OmniOS fileservers I mentioned that we had built them out with 10G-T networking for their iSCSI networking (using onboard Intel X540-AT2 based ports) and their public NFS interface (using one port of a dual-port Intel 82599EB TN card). Since then, well, things have not gone so well and in fact we're in the process of moving all production fileservers to 1G networking until we can understand what's going on and we can fix it.

The initial problems involved more or less total server lockups on our most heavily used fileserver. Due to some warning messages on the console and previous weird issues with onboard ports, we added a second dual-port card and moved the iSCSI networks to them. We also had iSCSI networking issues on two other servers, one of which was also switched to use a second dual-port card for iSCSI networking.

(At this point the tally is two fileservers using the onboard ports for 10G iSCSI and two fileservers using second dual-port cards for it.)

The good news is that the fileservers mostly stopped locking up at this point. The bad news is that both actively used dual-port cards wound up getting themselves into a state where the ixgbe driver couldn't talk properly to the second port and this had very bad effects, most specifically an extremely long lock hold time with spinlocks. At first we saw this only with the first card that had been replaced, on our most-used fileserver, so it was possible for me to believe that this was just a hardware fault (after all, the second port was working fine on the less used fileserver). Today we had exactly the same issue appear on the other fileserver, so it seems extremely likely that there is some combination of a driver bug and a hardware bug involved, one that is probably more and more likely to manifest as you pass more traffic through the ports.

(On top of that problem, we also found a consistent once a second 20msec lock hold time and stall in the ixgbe driver when dealing with those onboard X540-AT2 ports. Interested parties are directed to this email to the illumos-developer mailing list for full details about both issues. Note that it was written when I still thought the big stall might be due to faulty hardware on a single card.)

My understanding is that the Illumos (and thus OmniOS) ixgbe driver is derived from an upstream general Intel driver through a process that must be done by hand and apparently has not happened for several years. At this point enough bad stuff has shown up that I don't think we can trust the current OmniOS version of the driver and we probably don't want to try Intel 10G-T again until it's updated. Unfortunately I don't have any idea if or when that will happen.

(It also seems unlikely that we'll find any simple or quick reproduction for any of these problems in a test environment. My suspicion is that the dual-port issue is due to some sort of narrow race window involving hardware access, so it may depend not just on total traffic volume but on the sort of traffic you send.)

OmniOS10GIntelProblems written at 01:10:10; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.