Some effects of the ZFS DVA format on data layout and growing ZFS pools
One piece of ZFS terminology is DVA and DVAs, which is short for Data Virtual Address. For ZFS, a DVA is the equivalent of a block number in other filesystems; it tells ZFS where to find whatever data we're talking about. The short summary of what fields DVAs have and what they mean is that DVAs tell us how to find blocks by giving us their vdev (by number) and their byte offset into that particular vdev (and then their size). A typical DVA might say that you find what it's talking about on vdev 0 at byte offset 0x53a40ed000. There are some consequences of this that I hadn't really thought about until the other day.
Right away we can see why ZFS has a problem removing a vdev; the vdev's number is burned into every DVA that refers to data on it. If there's no vdev 0 in the pool, ZFS has no idea where to even start looking for data because all addressing is relative to the vdev. ZFS pool shrinking gets around this by adding a translation layer that says where to find the portions of vdev 0 that you care about after it's been removed.
In a mirror vdev, any single disk must be enough by itself to recover all data. Since the DVA simply specifies a byte offset within the vdev, this implies that in ZFS mirror vdevs, all copies of a block are at the same place on each disk, contrary to what I once thought might be the case. If vdev 0 is a mirror vdev, our DVA says that we can find our data at byte offset 0x53a40ed000 on each and every disk.
In a RAID-Z vdev, our data lives across multiple disks (with parity) but we only have the byte offset to its start (and then its size). The first implication of this is that in a RAID-Z vdev, a block is always striped sequentially across your disks at basically the same block offsets. ZFS doesn't find one bit of free space on disk 1, a separate bit on disk 2, a third bit on disk 3, and so on, and join them all together; instead it finds a contiguous stripe of free space starting on some disk, and uses it. This space can be short or long, it doesn't have to start on the first disk in the RAID-Z vdev, and it can wrap around (possibly repeatedly).
(This makes it easier for me to understand why ZFS rounds raidzN write sizes up to multiples of N+1 blocks. Possibly I understood this at some point, but if so I'd forgotten it since.)
Another way to put this is that for RAID-Z vdevs, the DVA vdev byte
addresses snake across all of the vdev's disks in sequence, switching
to a new disk ever
asize bytes. In a vdev with a 4k asize, vdev bytes
0 to 4095 are on the first disk, vdev bytes 4096 to 8191 are on the
the second disk, and so on. The unfortunate implication of this is
that the number of disks in a RAID-Z vdev is an implicit part of the
addresses of data in it. The mapping from vdev byte offset to the disk
and the disk's block where the block's stripe starts depends on how many
disks are in the RAID-Z vdev.
(I'm pretty certain this means that I was wrong in my previous explanation of why ZFS can't allow you to add disks to raidz vdevs. The real problem is not inefficiency in the result, it's that it would blow up your ability to access all data in your vdev.)
ZFS can grow both mirror vdevs and raidz vdevs if you replace the disks with larger ones because in both cases this is just adding more available bytes of space at the top of ZFS's per-vdev byte address range for DVAs. You have to replace all of the disks because in both cases, all disks participate in the addressing. In mirror vdevs this is because you write new data at the same offset into each disk, and in raidz vdevs it's because the addressable space is striped across all disks and you can't have holes in it.
(You can add entire new vdevs because that doesn't change the interpretation of any existing DVAs, since the vdev number is part of the DVA and the byte address is relative to the vdev, not the pool as a whole. This feels obvious right now but I want to write it down for my future self, since someday it probably won't be as clear.)
Why ZFS is not good at growing and reshaping pools (or shrinking them)
I recently read Mark McBride's Five Years of Btrfs (via), which has a significant discussion of why McBride chose Btrfs over ZFS that boils down to ZFS not being very good at evolving your pool structure. You might doubt this judgment from a Btrfs user, so let me say as both a fan of ZFS and a long term user of it that this is unfortunately quite true; ZFS is not a good choice if you want to modify your pool disk layout significantly over time. ZFS works best if the only change in your pools that you do is replacing drives with bigger drives. In our ZFS environment we go to quite some lengths to be able to expand pools incrementally over time, and while this works it both leaves us with unbalanced pools and means that we're basically forced to use mirroring instead of RAIDZ.
(An unbalanced pool is one where some vdevs and disks have much more data than others. This is less of an issue for us now that we're using SSDs instead of HDs.)
You might sensibly ask why ZFS is not good at this, despite being many years old (and people having had this issue with ZFS for a long time). One fundamental reason is that ZFS is philosophically and practically opposed to rewriting existing data on disk; once written, it wants everything to be completely immutable (apart from copying it to replacement disks, and more or less). But any sort of restructuring or re-balancing of a pool of storage (whether ZFS or Btrfs or whatever) necessarily involves shifting data around; data that used to live on this disk must be rewritten so that it now lives on that disk (and all of this has to be kept track of, directly or indirectly). It's rather difficult to have immutable data but mutable storage layouts.
(In the grand tradition of computer science we can sort of solve this problem with a layer of indirection, where the top layer stays immutable but the bottom layer mutates. This is awkward and doesn't entirely satisfy either side, and is in fact how ZFS's relatively new pool shrinking works.)
This is also the simpler approach for ZFS to take. Not having to support reshaping your storage requires less code and less design (for instance, you don't have to figure out how to reliably keep track of how far along a reshaping operation is). Less code also means less bugs, and bugs in reshaping operations can be catastrophic. Since ZFS was not designed to support any real sort of reshaping, adding it would be a lot of work (in both design and code) and raise a lot of questions, which is a good part of why no one has really tackled this for all of the years that ZFS has been around.
(The official party line of ZFS's design is more or less that you should get your storage right the first time around, or to put it another way, that ZFS was designed for locally attached storage where you start out with a fully configured system rather than incrementally expanding to full capacity over time.)
(This is an aspect of how ZFS is not a universal filesystem. Just as ZFS is not good for all workloads, it's not good for all patterns of growth and system evolution.)