2018-07-04
How and why we sell storage to people here
As a university department with a centralized fileserver environment plus a wide variety of professors and research groups, we have a space allocation problem. Namely, we need some way to answer the question of who gets how much space, especially in the face of uneven grant funding levels. Our historical and current answer is that we allocate space by selling it to people for a fixed one-time cost (for various reasons we can't give people free space). People can have as much space as they're willing to pay for; if they want so much space that we run out of currently available but not allocated space, we'll buy more hardware to meet the demand (and we'll be very happy about it, because we've historically had plenty of unsold space).
In the very old days that were mostly before my time here, our fileserver environment used Solaris DiskSuite on fixed-size partitions carved out from 'hardware RAID' FibreChannel controllers in a SAN setup. In this environment, one partition was one filesystem, and that was the unit we sold; if you wanted more storage space, my memory is that you had to put it in another filesystem whether you liked that or not, and obviously this meant that you had to effectively pre-allocate your space among your filesystems.
Our first generation ZFS fileserver environment followed this basic pattern but with some ZFS flexibility added on top. Our iSCSI backends exported standard-sized partitions as individual LUNs, which we called chunks, and some number of mirrored pairs of chunks were put together as a ZFS pool that belonged to one professor or group (which led to us needing many ZFS pools). We had to split up disks into multiple chunks partly because not doing so would have been far too wasteful; we started out with 750 GB Seagate disks and many professors or groups had bought less total space than that. We also wanted people to be able to buy more space without facing a very large bill, which meant that the chunk size had to be relatively modest (since we only sold whole chunks). We carried this basic chunk based space allocation model forward into our second generation of ZFS fileservers, which was part of why we had to do a major storage migration for this shift.
Then, well, we changed our minds, where I actually mean that our director worked out how to do things better. Rather than forcing people to buy an entire chunk's worth of space at once, we've moved to simply selling them space in 1 GB units; professors can buy 10 GB, 100 GB, 300 GB, 1000 GB, or whatever they need or want. ZFS pools are still put together from standard-sized chunks of storage, but that's now an implementation detail that only we care about; when you buy some amount of space, we make sure your pool had enough chunks to cover that space. We use ZFS quotas (on the root of each pool) to limit how much space in the pool can actually be used, which was actually something we'd done from the very beginning (our ZFS pool chunk size was much larger than our on FC SAN standard partition size, so some people got limited in the conversion).
This shift to selling in 1 GB units is now a few years old and has proven reasonably popular; we've had a decent number of people buy both small and large amounts of space, certainly more than were buying chunks before (possibly because the decisions are easier). I suspect that it's also easier to explain to people, and certainly it's clear what a professor gets for their money. My guess is that being able to buy very small amounts of space (eg 50 GB) to meet some immediate and clear need also helps.
(Professors and research groups that have special needs and their own grant funding can buy their own hardware and have their Point of Contact run it for them in their sandbox. There have been a fairly wide variety of such fileservers over the years.)
PS: There are some obvious issues with our general approach, but there are also equal or worse issues with other alternate approaches in our environment.
The hardware and basic setup for our third generation of ZFS fileservers
As I mentioned back in December, we are slowly working on the design and build out of our next (third) generation of ZFS NFS fileservers, to replace the current generation, which dates from 2014. Things have happened a little sooner than I was expecting us to manage, but the basic reason for that is we temporarily had some money. At this point we have actually bought all the hardware and more or less planned out the design of the new environment (assuming that nothing goes wrong on the software side), so today I'm going to run down the hardware and the basic setup.
After our quite positive experience with the hardware of our second generation fileservers, we have opted to go with more SuperMicro servers. Specifically we're using SuperMicro X11SPH-nCTF motherboards with Xeon Silver 4108 CPUs and 192 GB of RAM (our first test server has 128 GB for obscure reasons). This time around we're not using any addon cards, as the motherboard has just enough disk ports and some 10G-T Ethernet ports, which is all that we need.
(The X11SPH-nCTF has an odd mix of disk ports; 8x SAS on one PCI controller, 8x SATA on another PCIE controller, and an additional 2x SATA on a third. The two 8x setups use high-density connectors; the third 2x SATA has two individual ports.)
All of this goes in a 2U SuperMicro SC 213AC-R920LPB case, which gives us 16 hot swappable 2.5" front disk bays. This isn't quite enough disk bays for us, so we've augmented the case with what SuperMicro calls the CSE-M14TQC mobile rack; this goes in an otherwise empty space on the front and gives us an additional four 2.5" disk bays (only two of which we can wire up). We're using the 'mobile rack' disk bays for the system disks and the proper 16-bay 2.5" disk bays for data disks.
(Our old 3U SC 836BA-R920B cases have two extra 2.5" system disk bays on the back, so they didn't need the mobile rack hack.)
For disks, this time around we're going all SSD for the ZFS data disks, using 2 TB Crucial SSDs in a mix of MX300s and MX500s. We don't have any strong reason to go with Crucial SSDs other than that they're about the least expensive option that we trusted; we have a mix because we didn't buy all our SSDs at once and then Crucial replaced the MX300s with the MX500s. Each fileserver will be fully loaded with 16 data SSDs (and two system SSDs).
(We're not going to be using any sort of SAN, so moving up to 16 disks in a single fileserver is still giving us the smaller fileservers we want. Our current fileservers have access to 12 mirrored pairs of 2 TB disks; these third generation fileservers will have access to only 8 mirrored pairs.)
This time around I've lost track of how many of these servers we've bought. It's not as many as we have in our current generation of fileservers, though, because this time around we don't need three machines to provide those 12 mirrored pairs of disks (a fileserver and two iSCSI backends); instead we can provide them with one and a half machines.
Sidebar: On the KVM over IP on the X11SPH-nCTF
The current IPMI firmware that we have still has a Java based KVM over IP, but at least this generation works with the open source IcedTea Java I have on Fedora 27 (the past generation didn't). I've heard rumours that SuperMicro may have a HTML5 KVM over IP either in an updated firmware for these motherboards or in more recent motherboards, but so far I haven't found any solid evidence of that. It sure would be nice, though. Java is kind of long in the tooth here.
(Maybe there is a magic setting somewhere, or maybe the IPMI's little web server doesn't think my browser is HTML5 capable enough.)