A retrospective on our Solaris ZFS-based NFS fileservers (part 1)
We're in the slow process of replacing our original Solaris ZFS fileserver environment with a second generation environment. With our current fileservers enter their sunset period it's a good time to take an uncommon retrospective look back over their six years of operation and talk about what went well and what didn't quite do so. Today I'm going to lead with the good stuff about our Solaris machines.
(I'm actually a bit surprised that it's been six years, but that's what the dates say. I wrote the fileservers up in October of 2008 and they'd already been in operation for several months at that point.)
The headline result is that our fileserver environment has worked great overall. We've had six years of service with very little disruption and no data loss. We've had many disks die, we've had entire iSCSI backends fail, and through it all ZFS and everything else has kept trucking along. This is actually well above my expectations six years ago, when I had a very low view of ZFS's long-term reliability and expected to someday lose a pool to ZFS corruption over the lifetime of our fileservers.
The basics of ZFS have been great and using ZFS has been a significant advantage for us. From my perspective, the two big wins with ZFS have been flexible space management for actual filesystems and ZFS checksums and scrubs, which have saved us in ways large and small. Flexible space management has sometimes been hard to explain to people in a way that they really get, but it's been very nice to simply be able to make filesystems for logical reasons and not have to ask people to pre-plan how much space they get; they can use as little or more or less as much as they need.
Solaris in general and Solaris NFS in particular has been solid in normal use and we haven't seen any performance issues. We used to have some mysterious NFS mount permission issues (where a filesystem wouldn't mount or work on some systems) but they haven't cropped up on our systems for a few years from what I remember. Our Solaris 10 update 8 installs may not be the most featureful or up to date systems but in general they've given us no problems; they just sit in their racks and run and run and run (much like the iSCSI backends). I think it says good things that they reached over 650 days of uptime recently before we decided to reboot them as a sort of precaution after one crashed mysteriously.
Okay, I'll admit it: Solaris has not been completely and utterly rock solid for us. We've had one fileserver that just doesn't seem to like life, for reasons that we're not sure about; it is far more sensitive to disk errors and it's locked up several times over the years. Since we've replaced the hardware and reinstalled the software, my vague theory is that it's something to do with either or both of the NFS load it gets or the disks it's dealing with (it has most of our flaky 1TB Seagate disks, which fail at rates far higher than the other drives).
One Solaris feature deserves special mention. DTrace (and with it Solaris source code) turned out to be a serious advantage and very close to essential for solving an important performance problem we had. We might have eventually found our issue without DTrace but I'm pretty sure DTrace made it faster, and DTrace has also given us useful monitoring tools in general. I've come around to considering DTrace an important feature and I'm glad I get to keep it in our second generation environment (which will be using OmniOS on the fileservers).
I guess the overall summary is that for six years, our Solaris ZFS-based NFS fileservers have been boring almost all of the time; they work and they don't cause problems, even when crazy things happen. This has been especially true for the last several years, ie after we shook out the initial problems and got used to what to do and not to do.
(We probably could have made our lives more exciting for a while by upgrading past Solaris 10 update 8 but we never saw any reason to do that. After all, the machines worked fine with S10U8.)
That isn't to say that Solaris has been completely without problems and that everything has worked out for us as we planned. But that's for another entry (this one is already long enough).
Update: in the initial version of this entry I completely forgot to mention that the Solaris iSCSI initiator (the client) has been problem free for us (and it's obviously a vital part of the fileserver environment). There are weird corner cases but those happen anywhere and everywhere.