A retrospective on our OmniOS ZFS-based NFS fileservers
Our OmniOS fileservers have now been out of service for about six months, which makes it somewhat past time for a retrospective on them. Our OmniOS fileservers followed on our Solaris fileservers, which I wrote a two part retrospective on (part 1, part 2), and have now been replaced by our Linux fileservers. To be honest, I have been sitting on my hands about writing this retrospective because we have mixed feelings about our OmniOS fileservers.
I will put the summary up front. OmniOS worked reasonably well for us over its lifespan here and looking back I think it was almost certainly the right choice for us at the time we made that choice (which was 2013 and 2014). However it was not without issues that marred our experience with it in practice, although not enough to make me regret that we ran it (and ran it for as long as we did). Part of our issues are likely due to a design mistake in making our fileservers too big, although this design mistake was probably magnified when we were unable to use Intel 10G-T networking in OmniOS.
On the one hand, our OmniOS fileservers worked, almost always reliably. Like our Solaris fileservers before them, they ran quietly for years without needing much attention, delivering NFS fileservice to our Ubuntu servers; specifically, we ran them for about five years (2014 through 2019, although we started migrating away at the end of 2018). Over this time we had only minor hardware issues and not all that many disk failures, and we suffered no data loss (with ZFS checksums likely saving us several times, and certainly providing good reassurances). Our overall environment was easy to manage and was pretty much problem free in the face of things like failed disks. I'm pretty sure that our users saw a NFS environment that was solid, reliable, and performed well pretty much all of the time, which is the important thing. So OmniOS basically delivered the fileserver environment we wanted.
(Our Linux iSCSI backends ran so problem free that I almost forgot to mention them here; we basically got to ignore them the entire time we ran our OmniOS fileserver environment. I think that they routinely had multi-year uptimes; certainly they didn't go down outside of power shutdowns (scheduled or unscheduled).)
On the other hand, we ran into real limitations with OmniOS and our fileservers were always somewhat brittle under unusual conditions. The largest limitation was the lack of working 10G-T Ethernet (with Intel hardware); now that we have Linux fileservers with 10G-T, it's fairly obvious what we were missing and that it did really matter. Our OmniOS fileservers were also not fully reliable; they would lock up, reboot, or perform very badly under an array of fortunately exceptional conditions to a far greater degree than we liked (for example, filesystems that hit quota limits). We also had periodic issues from having two iSCSI networks, where OmniOS would decide to use only one of them for one or more iSCSI targets and we had to fiddle things in magic ways to restore our redundancy. It says something that our OmniOS fileservers were by far the most crash-prone systems we operated, even if they didn't crash very often. Some of the causes of these issues were identified, much like our 10G-T problems, but they were never addressed in the OmniOS and Illumos kernel to the best of my knowledge.
(To be clear here, I did not expect them to be; the Illumos community only has so many person-hours available, and some of what we uncovered are hard problems in things like the kernel memory management.)
Our OmniOS fileservers were also harder for us to manage for an array of reasons that I mostly covered when I wrote about how our new fileservers wouldn't be based on Illumos, and in general there are costs we paid for not using a mainstream OS (costs that would be higher today). With that said, there are some things that I currently do miss about OmniOS, such as DTrace and our collection of DTrace scripts. Ubuntu may someday have an equivalent through eBPF tools, but Ubuntu 18.04 doesn't today.
In the final summary I don't regret us running our OmniOS servers when we did and for as long as we did, but on the whole I'm glad that we're not running them any more and I think our current fileserver architecture is better overall. I'm thankful for OmniOS's (and thus Illumos') faithful service here without missing it.
PS: Some of our OmniOS issues may have been caused by using iSCSI instead of directly attached disks, and certainly using directly attached disks would have made for smaller fileservers, but I suspect that we'd have found another set of problems with directly attached disks under OmniOS. And some of our problems, such as with filesystems that hit quota limits, are very likely to be independent of how disks were attached.