Some details of ZFS DVAs and what some of their fields store

December 30, 2017

One piece of ZFS terminology is DVA and DVAs, which is short for Data Virtual Address. For ZFS, a DVA is the equivalent of a block number in other filesystems; it tells ZFS where to find whatever data we're talking about. DVAs are generally embedded into 'block pointers', and you can find a big comment laying out the entire structure of all of this in spa.h. The two fields of a DVA that I'm interested in today are the vdev and the offset.

(The other three fields are a reserved field called GRID, a bit to say whether the DVA is for a gang block, and asize, the allocated size of the block on its vdev. The allocated size has to be a per-DVA field for various reasons. The logical size of the block and its physical size after various sorts of compression are not DVA or vdev dependent, so they're part of the overall block pointer.)

The vdev field of a DVA is straightforward; it is the index of the vdev that the block is on, starting from zero for the first vdev and counting up. Note that this is not the GUID of the vdev involved, which is what you might sort of expect given a comment that calls it the 'virtual device ID'. Using the index means that ZFS can never shuffle the order of vdevs inside a pool, since these indexes are burned into DVAs stored on disk (as far as I know, and this matches what zdb prints, eg).

The offset field tells you where to find the start of the block on the vdev in question. Because this is an offset into the vdev, not a device, different sorts of vdevs have different ways of translating this into specific disk addresses. Specifically, RAID-Z vdevs must generally translate a single incoming IO at a single offset to the offsets on multiple underlying disk devices for multiple IOs.

At this point we arrive at an interesting question, namely what units the offset is in (since there are a bunch of possible options). As far as I can tell from looking at the ZFS kernel source code, the answer is that the DVA offset is in bytes. Some sources say that it's in 512-byte sectors, but as far as I can tell this is not correct (and it's certainly not in larger units, such as the vdev's ashift).

(This doesn't restrict the size of vdevs in any important way, since the offset is a 63-bit field.)

One potentially important consequence of this is that DVA offsets are independent of the sector size of the underlying disks in vdevs. Provided that your vdev asize is large enough, it doesn't matter if you use disks with 512-byte logical sectors or the generally rarer disks with real 4k sectors (both physical and logical), and you can replace one with the other. Well, in theory, as there may be other bits of ZFS that choke on this (I don't know if ZFS's disk labels care, for example). But DVAs won't, which means that almost everything in the pool (metadata and data both) should be fine.

PS: There are additional complications for ZFS gang blocks and so on, but I'm omitting that in the interests of keeping this manageable.

Written on 30 December 2017.
« To get much faster, an implementation of Python must do less work
Understanding IMAP path prefixes in clients and servers »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sat Dec 30 01:49:19 2017
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.