Some effects of the ZFS DVA format on data layout and growing ZFS pools

January 29, 2020

One piece of ZFS terminology is DVA and DVAs, which is short for Data Virtual Address. For ZFS, a DVA is the equivalent of a block number in other filesystems; it tells ZFS where to find whatever data we're talking about. The short summary of what fields DVAs have and what they mean is that DVAs tell us how to find blocks by giving us their vdev (by number) and their byte offset into that particular vdev (and then their size). A typical DVA might say that you find what it's talking about on vdev 0 at byte offset 0x53a40ed000. There are some consequences of this that I hadn't really thought about until the other day.

Right away we can see why ZFS has a problem removing a vdev; the vdev's number is burned into every DVA that refers to data on it. If there's no vdev 0 in the pool, ZFS has no idea where to even start looking for data because all addressing is relative to the vdev. ZFS pool shrinking gets around this by adding a translation layer that says where to find the portions of vdev 0 that you care about after it's been removed.

In a mirror vdev, any single disk must be enough by itself to recover all data. Since the DVA simply specifies a byte offset within the vdev, this implies that in ZFS mirror vdevs, all copies of a block are at the same place on each disk, contrary to what I once thought might be the case. If vdev 0 is a mirror vdev, our DVA says that we can find our data at byte offset 0x53a40ed000 on each and every disk.

In a RAID-Z vdev, our data lives across multiple disks (with parity) but we only have the byte offset to its start (and then its size). The first implication of this is that in a RAID-Z vdev, a block is always striped sequentially across your disks at basically the same block offsets. ZFS doesn't find one bit of free space on disk 1, a separate bit on disk 2, a third bit on disk 3, and so on, and join them all together; instead it finds a contiguous stripe of free space starting on some disk, and uses it. This space can be short or long, it doesn't have to start on the first disk in the RAID-Z vdev, and it can wrap around (possibly repeatedly).

(This makes it easier for me to understand why ZFS rounds raidzN write sizes up to multiples of N+1 blocks. Possibly I understood this at some point, but if so I'd forgotten it since.)

Another way to put this is that for RAID-Z vdevs, the DVA vdev byte addresses snake across all of the vdev's disks in sequence, switching to a new disk ever asize bytes. In a vdev with a 4k asize, vdev bytes 0 to 4095 are on the first disk, vdev bytes 4096 to 8191 are on the the second disk, and so on. The unfortunate implication of this is that the number of disks in a RAID-Z vdev is an implicit part of the addresses of data in it. The mapping from vdev byte offset to the disk and the disk's block where the block's stripe starts depends on how many disks are in the RAID-Z vdev.

(I'm pretty certain this means that I was wrong in my previous explanation of why ZFS can't allow you to add disks to raidz vdevs. The real problem is not inefficiency in the result, it's that it would blow up your ability to access all data in your vdev.)

ZFS can grow both mirror vdevs and raidz vdevs if you replace the disks with larger ones because in both cases this is just adding more available bytes of space at the top of ZFS's per-vdev byte address range for DVAs. You have to replace all of the disks because in both cases, all disks participate in the addressing. In mirror vdevs this is because you write new data at the same offset into each disk, and in raidz vdevs it's because the addressable space is striped across all disks and you can't have holes in it.

(You can add entire new vdevs because that doesn't change the interpretation of any existing DVAs, since the vdev number is part of the DVA and the byte address is relative to the vdev, not the pool as a whole. This feels obvious right now but I want to write it down for my future self, since someday it probably won't be as clear.)

Comments on this page:

You may be interested in Max Bruning's "ZFS On-Disk Data Walk" presentation from the 2008 OpenSolaris Developer Conference

Video (potato-ish quality):

See also Kirk McKusick at the 2015 OpenZFS Conference with "ZFS Internals Overview":

He co-wrote The Design and Implementation of the FreeBSD Operating System 2e, and chapter ten covers ZFS.

By scineram at 2020-02-07 03:43:59:

But you will be able to add disks to raidz soon. Hopefully next year.

Written on 29 January 2020.
« Why ZFS is not good at growing and reshaping pools (or shrinking them)
Some notes on Python's email.header.decode_header() »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed Jan 29 22:41:19 2020
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.