2011-03-16
Thinking about our alternatives to Solaris
In light of yesterday's entry, the obvious question is what options we have for an alternative to Solaris in the next generation of our fileserver infrastructure. I've been thinking about this for a while, and at the moment there are three main alternatives.
The obvious option is OpenSolaris Illumos. However, there's a lot of question marks
around whether Illumos will really be able to deliver a viable
open source Solaris alternative. There's also even bigger question
marks about whether it makes sense for us to build a core piece of
infrastructure on top of an open source OS that we basically have to
maintain ourselves. I say that we'll have to maintain itself because I
don't expect an Illumos-based equivalent of Red Hat to appear, someone
who provides a long-term supported stable version with security and
functionality updates.
(There are people building storage products with Illumos, but we don't want a storage product, we want a distribution.)
Next up is FreeBSD with ZFS. This still suffers from many of the issues I've written about before, and my gut doesn't like it. However, I have to admit that if FreeBSD remains committed to ZFS it may be a better alternative than Illumos, since the FreeBSD team already has a solid track record for delivering a solid OS. And Oracle's decisions about OpenSolaris mean that the rate of ZFS changes to integrate into FreeBSD may slow way down.
The pessimistic flipside for both Illumos and FreeBSD is that right now we have no idea if the Oracle's public ZFS codebase will ever get any more updates; OpenSolaris may never get another public code drop. If it does not, both Illumos and FreeBSD would be left to reverse engineer and reimplement ZFS bugfixes and improvements, which might well not happen for one or both of them. This might well effectively fork ZFS, and I don't know if the open-source ZFS would really get developed further.
The final good alternative is Linux with btrfs, which promises the good features of ZFS but is currently entirely incapable of delivering them. Right now btrfs is is somewhere around half baked, with lots of good intentions, some amount of features, and a huge number of rough edges. Much of this will change in time if things go well, so I certainly hope that the picture looks rather different in two or three years, because there are no really good Linux-based alternatives other than btrfs.
Even in two or three years, building on btrfs will be risky. I'd expect that we'd be on the early edge of serious production deployments of it, so we might find all sorts of problems when using it at scale in an NFS fileserver environment.
2011-03-15
Our uncertain future with Solaris 11
One of the things that I've been doing over the last while is thinking about what our long-term future with Solaris is; my best guess is that 'long term' here starts somewhere around 2013. In theory the simple thing to do is to keep running Solaris in basically the same setup as we have now, just on updated hardware, but there are two potential problems with that idea.
The first is that it isn't at all clear if there will be affordable inexpensive 1U or 2U servers that we can legally run Solaris on, and even if there are we don't know if we'll be able to afford Oracle's support rates (or at least if we can justify spending that much money a year on it). In theory there might be officially supported inexpensive 1U servers from HP or Dell, but given Oracle's back and forth reactions to Solaris on third-party hardware, I'm not entirely sure I would trust doing that; even if Oracle supports it this year, what happens if they change their minds again?
(This is one result of Oracle screwing people.)
The second potential problem is that a fair bit of what we do with our
fileserver environment today relies on having (Open)Solaris source code
available to study, in order to figure out how to talk to undocumented
libraries and interpret undocumented data structures. OpenSolaris source
code is not being updated right now, and it remains to be seen if it
ever will be or how frequently. Running Solaris
without source is significantly more risky for us (one way or another),
unless Oracle officially exposes a lot more interfaces and information
in Solaris 11. For example, our alternative ZFS spares system (which we
wrote for good reasons) relies on the ability to
directly see ZFS pool state information, including information that is
not exposed by 'zpool status' (even if I wanted to parse its output),
and I don't think we'd be very happy going back to the old ZFS spares
system (even if it was bug-free this time around).
(Despite what I've said before, I don't think that Solaris 11 is going to go in directions that we don't want. If anything, I'm sort of looking forward to changes like a real package system, and overall I expect it to be better than Solaris 10.)
2011-03-10
What we did to get iSCSI multipathing working on Solaris 10 update 8
I've mentioned in the past that we use multipathing in our fileserver setup, but I've never written down what we needed to do to get this working for us.
First, we're using Solaris MPxIO, not iSCSI's own multipathing; neither Solaris nor (I believe) our Linux backends support MC/S. Also, see my earlier notes.
To start with, each iSCSI backend has two different IP addresses on two different networks. Then:
- we had to configure MPxIO so that it recognized our iSCSI backends as valid multipathing targets. This
is done by adding the vendor and product ID to
/kernel/drv/scsi_vhci.confin a special magic format. The comments in this file are actually a good guide to what you need to do. These days, you may find that your iSCSI backend is already automatically recognized by MPxIO and you don't need to do anything, especially if you're using a popular commercial one.Rebooting is required to activate this, but before you do so be very sure that your iSCSI disks have unique serial numbers.
You will know that multipathing is working for your disks when they show up as very long names instead of nice short ones.
- we use static target configuration, so we configured each target
for each of its iSCSI network IP addresses. I don't know how well
this works with any of the dynamic discovery mechanisms, but based
on previous experience I would make sure that your targets are only
advertising the IPs for your actual storage networks, not, say, their
management interface IP as well.
- make sure that the Solaris iSCSI initiator is configured to make
two connections per target. You have to do this after the target
is configured with both IPs; otherwise, Solaris will happily make
two connections to the same IP address, which is not what you want.
(This is done with '
iscsiadm modify initiator-node -c 2'.) - we found that we could not reliably use the onboard nVidia Ethernet ports on our SunFire X2200s. Apparently not even Sun could get good drivers for them. We switched to an Intel dual NIC card and had no problems.
Solaris MPxIO defaults to round-robin use of all of the available paths, which is what you want if you want maximum performance.
Once we got it up this setup has worked reliably and without problems for us, and at full speed. The targets and the backends can talk to each other at gigabit wire bandwidth and a suitable set of operations (talking to enough different disks) on a Solaris fileserver can read data at over 200 MBytes/sec, saturating both gigabit links to the backends. Note that this is without jumbo frames.
(This speed is purely local; since the fileservers only have a single gigabit link for NFS clients, they will not normally do more than 100 Mbytes/sec of IO to the backends. Well, I suppose writes could multiply this; if you managed to write to a pool with enough disks, the 100 Mbytes/sec from an NFS client could wind up doubling due to mirroring. In practice, our NFS clients are just not that active.)
Sidebar: troubleshooting performance issues
At the network level iSCSI is just a TCP stream, so the first thing to do with an iSCSI performance issue is to make sure that your network is working in general. If you cannot get sustained full wire bandwidth between your initiator and your target using a tool like ttcp, you have a network issue in general that you need to fix.
(This is obviously much easier to test if your targets are running some sort of general OS and aren't just closed appliances.)
If you see network problems, go through the normal network troubleshooting steps. For example, try connecting an initiator and a target together with a crossover cable so that you can rule out your switch. If you're using jumbo frames, try turning them off to see if it improves things; in fact, consider leaving them turned off unless you see a clear performance advantage to them (my experience with jumbo frames has not been very positive).
Next, of course, you need to verify that the disks on the target can actually deliver the performance that you expect to see. Some machines may have bad or underpowered disk subsystems, or some of the disks might have quietly gone bad. Again, this is a lot easier if your targets are running a real operating system where you can get in to directly measure local disk performance.
You should only start tuning iSCSI parameters or blaming the iSCSI software once you've verified that the network and the disks are both fine. Otherwise you may waste a lot of time chasing red herrings and dead ends.