What we do to enable us to grow our ZFS pools over time
In my entry on why ZFS isn't good at growing and reshaping pools, I mentioned that we go to quite some lengths in our ZFS environment to be able to incrementally expand our pools. Today I want to put together all of the pieces of that in one place to discuss what those lengths are.
Our big constraint is that not only do we need to add space to pools over time, but we have a fairly large number of pools and which pools will have space added to them is unpredictable. We need a solution to pool expansion that leaves us with as much flexibility as possible for as long as possible. This pretty much requires being able to expand pools in relatively small increments of space.
The first thing we do, or rather don't do, is that we don't use raidz. Raidz is potentially attractive on SSDs (where the raidz read issue has much less impact), but since you can't expand a raidz vdev, the minimum expansion for a pool using raidz vdevs is at least three or four separate 'disks' to make a new raidz vdev (and in practice you'd normally want to use more than that to reduce the raidz overhead, because a four disk raidz2 vdev is basically a pair of mirrors with slightly more redundancy but more awkward management and some overheads). This requires adding relatively large blocks of space at once, which isn't feasible for us. So we have to do ZFS mirroring instead of the more space efficient raidz.
(A raidz2 vdev is also potentially more resilient than a bunch of mirror vdevs, because you can lose any arbitrary two disks without losing the pool.)
However, plain mirroring of whole disks would still not work for us because that would mean growing pools by relatively large amounts of space at a time (and strongly limit how many pools we can put on a single fileserver). To enable growing pools by smaller increments of space than a whole disk, we partition all of our disks into smaller chunks, currently four chunks on a 2 TB disk, and then do ZFS mirror vdevs using chunks instead of whole disks. This is not how you're normally supposed to set up ZFS pools, and on our older fileservers using HDs over iSCSI it caused visible performance problems if a pool ever used two chunks from the same physical disk. Fortunately those seem to be gone on our new SSD-based fileservers.
Even with all of this we can't necessarily let people expand existing pools by a lot of space, because the fileserver their pool is on may not have enough free space left (especially if we want other pools on that fileserver to still be able to expand). When people buy enough space at once, we generally wind up starting another ZFS pool on a different fileserver, which somewhat cuts against the space flexibility that ZFS offers. People may not have to decide up front how much space they want their filesystems to have, but they may have to figure out which pool a new filesystem should go into and then balance usage across all of their pools (or have us move filesystems).
(Another thing we do is that we sell pool space to people in 1 GB increments, although usually they buy more at once. This is implemented using a pool quota, and of course that means that we don't even necessarily have to grow the pool's space when people buy space; we can just increase the quota.)
Although we can grow pools relatively readily (when we need to), we still have the issue that adding a new vdev to a ZFS pool doesn't rebalance space usage across all of the pool's vdevs; it just mostly writes new data to the new vdev. In a SSD world where seeks are essentially free and we're unlikely to saturate the SSD's data transfer rates on any regular basis, this imbalance probably doesn't matter too much. It does make me wonder if nearly full pool vdevs interact badly with ZFS's issues with coming near quota limits (and a followup).