2018-12-30
Our third generation ZFS fileservers and their setup
Our third generation of ZFS-based NFS fileservers is now in production (although we've only started to migrate filesystems to them from our current OmniOS fileservers), so it's a good time to write up all of their elements in one place. At one level our third generation is much like our first and second generation, with basically the same ZFS setup, but at another level they're very different. Put simply, the first two generations used Solaris and then OmniOS with an iSCSI SAN, while this generation uses Linux with ZFS on Linux and strictly local disks (and the disks are now SATA SSDs).
(This shift from SAN-based HDs to local SSDs seems likely to have performance effects that change some of our assumptions, but that's going to take operational experience to really understand.)
The physical hardware is as I described earlier in more detail here. We're using SuperMicro X11SPH-nCTF motherboards with Xeon Silver 4108 CPUs and 192 GB of RAM, in SuperMicro SC 213AC-R920LPB cases. We're driving the 16 data disks through 8x SAS and 8x SATA on the motherboard, with no need for addon cards this time around. For networking, we use one of the onboard 10G-T ports. It's a nice change to not have to worry about various addon cards, although it means that if we have a hardware fault we've probably lost an entire motherboard.
The general architecture and the administrative level have remained basically the same. We're still NFS exporting filesystems from (multiple) ZFS pools, with each ZFS pool made up of mirrored pairs of disk partitions in a standard size. Since we used 2 TB HDs in the second generation and we're using 2 TB Crucial SSDs now, we stayed with four partitions per disk and the partitions are actually exactly the same size as before (although they probably have a few MB of extra space because we have less levels of things involved now). We've determined that it's now safe to allow a pool to use more than one partition on a single disk, although we still try to avoid it. For relatively arbitrary reasons our standard mirrored pairs of partitions always use one SAS-connected disk and its twin SATA-connected disk; we have to pick pairs somehow, and this distributes them around the chassis physically.
(Since both the SAS and the SATA controllers are on the motherboard, we don't really expect to see one fail in isolation, so we're not trying to buy ourselves safety from a controller failure. We may be buying safety from controller driver bugs, though, but we hope that there aren't any.)
The fileservers run Ubuntu 18.04 LTS with the Ubuntu supplied version of ZFS on Linux, because this is the least effort approach and it works well enough. We have tweaked things to increase the ZFS ARC size above the default of 50% of RAM, since leaving 96 GB basically unused would be wasteful; our current approach is to leave 32 GB free for non-ARC things. We've carefully frozen the version of the kernel and a few other crucial packages that they run and won't be upgrading them unless we really have to (for instance the new versions contain an important ZFS on Linux bugfix). We do this because we don't trust kernel upgrades to not introduce new problems and qualifying these machines is hard; we really only find out for sure when we run them in production, and problems are very disruptive. We will probably not upgrade the fileservers to Ubuntu 20.04 LTS when it comes out; we might upgrade to 22.04, or we might actually get to build out a fourth generation of fileservers on new hardware by then.
(If CentOS 8 had come out early this summer, we might have used it instead of Ubuntu LTS, but we very much were not prepared to use RHEL/CentOS 7 for the lifetime of these fileservers.)
Operationally, these new Linux fileservers have reused basically all of the local software from our current OmniOS fileservers. Some of it needed minor adaptation to cope with no longer having iSCSI backends, but the actual local commands have stayed the same and much of the code is shared or essentially the same. This includes our local programs to make managing ZFS pools less dangerous, and even our hand-rolled ZFS spares system, which is now partially activated through hooking into ZED, the (Linux) ZFS Event Daemon. Some things required some adaptation, for instance the portion of our filesystem-managing command that deals with NFS exports, and we had to build a completely new Linux-specific system for custom NFS mount authentication.
(Unlike our regular Ubuntu machines, which synchronize time through
a crontab entry that runs ntpdate
,
our fileservers run chrony. The
justification for this does not fit in this entry.)
We currently have six fully populated third generation fileservers in production. Well, technically one of the six has yet to have any filesystems migrated to it, but we still consider it to be in production. There's a seventh unit that is currently a cold spare but which may be used for expansion in the future, and an initial test unit (with only 128 GB of RAM) that's being used for something else at the moment and in the future will probably be a test unit again. This is actually fewer servers in total than the current OmniOS fileservers and their iSCSI backends, but we have more RAM in total, especially in the fileservers.
It will be some time before we have all filesystems migrated over from the current OmniOS fileservers and their iSCSI backends (I'll be happy if we can manage it in six months). Once we free up that hardware, we're likely to reuse it for something, since it remains perfectly good hardware. Especially, it's perfectly good hardware with 10G-T Ethernet, reasonably amounts of memory, and lots of disk bays that we trust.
(Our first generation hardware didn't really get reused, but that was because we were pretty tired of eSATA by the time it got replaced and the inexpensive eSATA chassises were starting to accumulate various hardware issues. All of the SuperMicro hardware in our second generation has held up much better so far.)
PS: Given how we sell storage these days, dividing up the disks into partitions is mostly an issue of administrative convenience and wasting relatively little space for small pools. Going with exactly the same size for partitions makes our lives easier when we're creating pools on the new fileservers, since we can just use the same number of mirrored pairs as the pools have on our current OmniOS fileservers.