One of our costs of using OmniOS was not having 10G networking
OmniOS has generally been pretty good to us over the lifetime of our second generation ZFS fileservers, but as we've migrated various filesystems from our OmniOS fileservers to our new Linux fileservers, it's become clear that one of the costs we paid for using OmniOS was not having 10G networking.
We certainly started out intending to have 10G networking on OmniOS; our hardware was capable of it, with Intel 10G-T chipsets, and OmniOS seemed happy to drive them at decent speeds. But early on we ran into a series of fatal problems with the Intel ixgbe driver which we never saw any fixes for. We moved our OmniOS machines (and our iSCSI backends) back to 1G, and they have stayed there ever since. When we made this move, we did not have detailed system metrics on things like NFS bandwidth usage by clients, and anyway almost all of our filesystems were on HDs, so 1G seemed like it should be fine. And indeed, we mostly didn't see obvious and glaring problems, especially right away.
What setting up a metrics system (even only on our NFS clients) and
then later moving some filesystems from OmniOS (at 1G) to Linux (at
10G) made clear was that on some filesystems, we had definitely
been hitting the 1G bandwidth limit and doing so had real impacts.
The filesystem this was most visible on is the one that holds
/var/mail, our central location for people's mailboxes (ie, their
IMAP inbox). This was always on SSDs even on OmniOS, and once we
started really looking it was clearly bottlenecked at 1G. It was
one of the early filesystems we moved to the Linux fileservers, and
the improvement was very visible. Our IMAP server, which has 10G
itself, now routinely has bursts of over 200 Mbps inbound and
sometimes sees brief periods of almost saturated network bandwidth.
More importantly, the IMAP server's performance is visibly better;
it is less loaded and more responsive, especially at busy times.
(A contributing factor to this is that any number of people have
very big inboxes, and periodically our IMAP server winds up having
to read through all of such an inbox. This creates a very asymmetric
traffic pattern, with huge inbound bandwidth from the
fileserver to the IMAP server but very little outbound traffic.)
It's less clear how much of a cost we paid for HD-based filesystems, but it seems pretty likely that we paid some cost, especially since our OmniOS fileservers were relatively large (too large, in fact). With lots of filesystems, disks, and pools on each fileserver, it seems likely that there would have been periods where each fileserver could have reached inbound or outbound network bandwidth rates above 1G, if they'd had 10G networking.
(And this excludes backups, where it seems quite likely that 10G would have sped things up somewhat. I don't consider backups as important as regular fileserver NFS traffic because they're less time and latency sensitive.)
At the same time, it's quite possible that this cost was still worth paying in order to use OmniOS back then instead of one of the alternatives. ZFS on Linux was far less mature in 2013 and 2014, and I'm not sure how well FreeBSD would have worked, especially if we insisted on keeping a SAN based design with iSCSI.
(If we had had lots of money, we might have attempted to switch to other 10G networking cards, probably SFP+ ones instead of 10G-T (which would have required switch changes too), or to commission someone to fix up the ixgbe driver, or both. But with no funds for either, it was back to 1G for us and then the whole thing was one part of why we moved away from Illumos.)
Comments on this page:Written on 17 May 2019.