A retrospective on our OmniOS ZFS-based NFS fileservers
Our OmniOS fileservers have now been out of service for about six months, which makes it somewhat past time for a retrospective on them. Our OmniOS fileservers followed on our Solaris fileservers, which I wrote a two part retrospective on (part 1, part 2), and have now been replaced by our Linux fileservers. To be honest, I have been sitting on my hands about writing this retrospective because we have mixed feelings about our OmniOS fileservers.
I will put the summary up front. OmniOS worked reasonably well for us over its lifespan here and looking back I think it was almost certainly the right choice for us at the time we made that choice (which was 2013 and 2014). However it was not without issues that marred our experience with it in practice, although not enough to make me regret that we ran it (and ran it for as long as we did). Part of our issues are likely due to a design mistake in making our fileservers too big, although this design mistake was probably magnified when we were unable to use Intel 10G-T networking in OmniOS.
On the one hand, our OmniOS fileservers worked, almost always reliably. Like our Solaris fileservers before them, they ran quietly for years without needing much attention, delivering NFS fileservice to our Ubuntu servers; specifically, we ran them for about five years (2014 through 2019, although we started migrating away at the end of 2018). Over this time we had only minor hardware issues and not all that many disk failures, and we suffered no data loss (with ZFS checksums likely saving us several times, and certainly providing good reassurances). Our overall environment was easy to manage and was pretty much problem free in the face of things like failed disks. I'm pretty sure that our users saw a NFS environment that was solid, reliable, and performed well pretty much all of the time, which is the important thing. So OmniOS basically delivered the fileserver environment we wanted.
(Our Linux iSCSI backends ran so problem free that I almost forgot to mention them here; we basically got to ignore them the entire time we ran our OmniOS fileserver environment. I think that they routinely had multi-year uptimes; certainly they didn't go down outside of power shutdowns (scheduled or unscheduled).)
On the other hand, we ran into real limitations with OmniOS and our fileservers were always somewhat brittle under unusual conditions. The largest limitation was the lack of working 10G-T Ethernet (with Intel hardware); now that we have Linux fileservers with 10G-T, it's fairly obvious what we were missing and that it did really matter. Our OmniOS fileservers were also not fully reliable; they would lock up, reboot, or perform very badly under an array of fortunately exceptional conditions to a far greater degree than we liked (for example, filesystems that hit quota limits). We also had periodic issues from having two iSCSI networks, where OmniOS would decide to use only one of them for one or more iSCSI targets and we had to fiddle things in magic ways to restore our redundancy. It says something that our OmniOS fileservers were by far the most crash-prone systems we operated, even if they didn't crash very often. Some of the causes of these issues were identified, much like our 10G-T problems, but they were never addressed in the OmniOS and Illumos kernel to the best of my knowledge.
(To be clear here, I did not expect them to be; the Illumos community only has so many person-hours available, and some of what we uncovered are hard problems in things like the kernel memory management.)
Our OmniOS fileservers were also harder for us to manage for an array of reasons that I mostly covered when I wrote about how our new fileservers wouldn't be based on Illumos, and in general there are costs we paid for not using a mainstream OS (costs that would be higher today). With that said, there are some things that I currently do miss about OmniOS, such as DTrace and our collection of DTrace scripts. Ubuntu may someday have an equivalent through eBPF tools, but Ubuntu 18.04 doesn't today.
In the final summary I don't regret us running our OmniOS servers when we did and for as long as we did, but on the whole I'm glad that we're not running them any more and I think our current fileserver architecture is better overall. I'm thankful for OmniOS's (and thus Illumos') faithful service here without missing it.
PS: Some of our OmniOS issues may have been caused by using iSCSI instead of directly attached disks, and certainly using directly attached disks would have made for smaller fileservers, but I suspect that we'd have found another set of problems with directly attached disks under OmniOS. And some of our problems, such as with filesystems that hit quota limits, are very likely to be independent of how disks were attached.
The history and background of us using Prometheus
On Prometheus and Grafana after a year, a commentator asked some good questions:
Is there a reason why you went with a "metrics-based" (?) monitoring solution like Prometheus-Grafana, and not a "service-based" system like Zabbix (or Nagios)? What (if anything) was being used before the current P-G system?
I'll start with the short answer, which is that we wanted metrics as well as alerting and operating one system is simpler than operating two, even if Prometheus's alerting is not necessarily as straightforward as something intended primarily for that. The longer answer is in the history of how we got here.
Before the current Prometheus system, what we had was based on Xymon and had been in place sufficiently long that portions of it talked about 'Hobbit' (the pre-2009 name of Xymon, cf). Xymon as we were operating it was almost entirely a monitoring and alerting system, with very little to nothing in the way of metrics and metrics dashboards. We've understood for a long time that having metrics is important and we wanted to gather and use them, but we had never managed to turn this desire into actually doing anything (at one point I sort of reached a decision on what to build, but then I never actually built anything for various reasons).
In the fall of 2018 (last year), our existing Xymon setup reached a critical point where we couldn't just let it be, because it was hosted on an Ubuntu 14.04 machine. For somewhat unrelated reasons I wound up looking at Prometheus, and its quick-start demonstration sold me on the idea that it could easily generate useful metrics in our environment (and then let us see them in Grafana). My initial thoughts were to split metrics apart from alerting and to start by setting up Prometheus as our metrics system, then figure out alerting later. I set up a testing Prometheus and Grafana for metrics on a scratch server around the start of October.
Since we were going to run Prometheus and it had some alerting capabilities, I explored if it could more or less sufficiently cover our alerting needs. It turned out that it could, although perhaps not in an ideal way. However, running one system and gathering information once (more or less) is less work than also trying to pick a modern alerting system, set it up, and set up monitoring for it, especially if we wanted to do it on a deadline (with the end of Ubuntu's support for 14.04 looming up on it). We decided that we would at least get Prometheus in place now to replace Xymon, even if it wasn't ideal, and then possibly implement another alerting system later at more leisure if we decided that we needed to. So far we haven't felt a need to go that far; our alerts work well enough in Prometheus, and we don't have all that many custom 'metrics' that really exist only to trigger alerts.
(Things we want to alert on often turn out to also be things that we want to track over time, more often than I initially thought. We've also wound up doing more alerting on metrics than I expected us to.)
Given this history, it's not quite right for me to say that we chose Prometheus over other alternative metrics systems. Although we did do some evaluation of other options after I tried Prometheus's demo and started exploring it, what it basically boiled down to was we had decent confidence Prometheus could work (for metrics) and none of the other options seemed clearly better to the point where we should spend the time exploring them as well. Prometheus was not necessarily the best, it just sold us on that it was good enough.
(Some of the evaluation criteria I used turned out to be incorrect, too, such as 'is it available as an Ubuntu package'. In the beginning that seemed like an advantage for Prometheus and anything that was, but then we wound up abandoning the Ubuntu Prometheus packages as being too out of date.)