ZFS DVA offsets are in bytes, not (512-byte) blocks

August 30, 2022

In ZFS, a DVA (Device Virtual Address) is the equivalent of a block address in a regular filesystem. For our purposes today, the important thing is that a DVA tells you where to find data by a combination of the vdev (as a numeric index) and an offset into the vdev (and also a size). However, this description leaves a question open, which is what units are ZFS DVA offsets in. Back when I looked into the details of DVAs in 2017, the code certainly appeared to be treating the offset as being in bytes; however, various other sources have sometimes asserted that offsets are in units of 512-byte blocks. Faced with this uncertainty, today I decided to answer the question once and for all with some experimentation.

(One of the 'various sources' for the DVA offset being in 512-byte blocks is the "ZFS On-Disk Specification" that you can find copies of floating around on the Internet, eg currently here or this repository with the specification and additional information. See section 2.1.)

Update: This turns out to be wrong (or a misunderstanding). On disk, ZFS DVA offsets are stored as (512-byte) blocks, but tools like zdb print them as byte offsets. See ZFSDVAOffsetsInBytesII.

I'll start with the more or less full version of the experiment on a file-based ZFS pool.

# truncate --size 100m disk01
# zpool create tank disk01
# vi /tank/TESTFILE
[.. enter more than 512 bytes of text ...]
# sync
# zdb -vv -O tank TESTFILE
Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
     6    1   128K     1K     1K     512     1K  100.00  ZFS plain file
                                            176   bonus  System attributes
Indirect blocks:
  0 L0 0:7ea00:400 400L/400P F=1 B=49/49 cksum=[...]

This file is 1 KB (two 512-byte blocks) on disk (since it's not compressed; I deliberately didn't turn that on), and starts at offset 0x7ea00 aka 518656 in whatever unit that is. Immediately we have one answer; our 'disk01' data file only has 204800 512-byte blocks, so this cannot be a plain 512-byte block offset. However, if we just read at byte 518656 (block 1013), we won't succeed in finding our file data. Per various sources (eg ZFS Raidz Data Walk), there is a 4 MByte (0x400000 bytes) header that we also have to add in. That's 8192 512-byte blocks for the header, plus 1013 for the DVA offset gives us a block offset from the start of the file of 9205 512-byte blocks, so:

# dd if=disk01 bs=512 skip=9205 | sed 5q
line 2 of zfs test file
line 3 of zfs test file
line 4 of zfs test file
line 5 of zfs test file

I've found my test file exactly where it should be.

Just to be sure, I also did this same experiment on our test fileserver, where the ZFS pool uses mirrored disk partitions (instead of a file). The answer is the same; treating the ZFS DVA offset as a byte offset and adding 4 MBytes gets me the right (byte) offset into the disk partition to find the file contents that should be there. Although I haven't verified it in the code, I would be very surprised if raidz or draid DVA offsets are any different (although raidz DVA offsets snake across all disks in the raidz).

(This experiment is obviously much harder if you have a dataset with compression turned on. I don't know if there's any easy way to get zdb to decompress a block from standard input. Modern versions of zdb can read ZFS blocks directly, with the -R option, but while useful this doesn't quite help answer the question here. I guess I could have strace'd zdb to see what offset it read the block from.)

(This is one of my ZFS uncertainties that has quietly nagged at me for years but that I can now finally put to bed.)

Written on 30 August 2022.
« A thought on presentational versus semantic HTML
ZFS DVA offsets are in 512-byte blocks on disk but zdb misleads you about them »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Tue Aug 30 21:48:26 2022
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.