Our second generation ZFS fileservers and their setup
We are finally in the process of really migrating to the second generation of our ZFS fileserver setup, so it seems like a good time to write up all of the elements in one place. Our fundamental architecture remains unchanged. That architecture is NFS servers that export filesystems from ZFS pools to our client machines (which are mostly Ubuntu). The ZFS pools are made up of mirrored pairs, where each side of a mirror comes from a separate iSCSI backend. The fileservers and iSCSI backends are interconnected over two separate 'networks', which are actually single switches.
The actual hardware involved is unchanged from our basically finalized hardware; both fileservers and backends are SuperMicro motherboards with 2x 10G-T onboard in SuperMicro 16+2 drive bay cases. The iSCSI networks run over the motherboard 10G-T ports, and the fileservers also have a dual Intel 10G-T card for their primary network connection so we can do 10G NFS to them. Standard backends have 14 2TB WD SE drives for iSCSI (the remaining two data slots may someday be used for ZFS ZIL SSDs). One set of two backends (and a fileserver) is for special SSD based pools so they have some number of SSDs instead.
On the fileservers, we're running OmniOS (currently r151010j) in an overall setup that is essentially identical to our old S10U8 fileservers (including our hand rolled spares system). On the iSCSI backends we're running CentOS 7 after deciding that we didn't like Ubuntu 14.04. Although CentOS 7 comes with its own iSCSI target software we decided to carry on using IET, the same software we use our old backends; there just didn't seem to be any compelling reason to switch.
As before, we have sliced up the 2TB data disks into standard sized chunks. We decided to make our lives simple and have only four chunks on each 2TB disk, which means that they're about twice as big as our old chunk size. The ZFS 4K sector disk problem means that we have to create new pools and migrate all data anyways, so this difference in chunk size between the old and the new fileservers doesn't cause us any particular problems.
Also as before, each fileserver is using a different set of two backends to draw its disks from; we don't have or plan any cases where two fileservers use disks from the same backend. This assignment is just a convention, as all fileservers can see all backends and we're not attempting to do any sort of storage fencing; even though we're still not planning any failover, it still feels like too much complexity and potential problems.
In the end we went for a full scale replacement of our existing environment: three production fileservers and six production backends with HDs, one production fileserver and two production backends with SSDs, one hot spare fileserver and backend, and a testing environment of one fileserver and two fully configured backends. To save you the math, that's six fileservers, eleven backends, and 126 2TB WD Se disks. We also have three 10G-T switches (plus a fourth as a spare), two for the iSCSI networks and the third as our new top level 10G switch on our main machine room network.
(In the long run we expect to add some number of L2ARC SSDs to the fileservers and some number of ZFS ZIL SSDs to the backends, but we haven't even started any experimentation with this to see how we want to do it and how much benefit it might give us. Our first priority has been building out the basic core fileserver and backend setup. We definitely plan to add an L2ARC for one pool, though.)