ZFS DVA offsets are in 512-byte blocks on disk but zdb misleads you about them

August 31, 2022

Yesterday I asserted that ZFS DVA offsets were in bytes, based primarily on using zdb to dump a znode and then read a data block using the offset that zdb printed. Over on Twitter, Matthew Ahrens corrected my misunderstanding:

The offset is stored on disk as a multiple of 512, see the DVA_GET_OFFSET() macro, which passes shift=SPA_MINBLOCKSHIFT=9. For human convenience, the DVA is printed in bytes (e.g. by zdb). So the on-disk format can handle up to 2^72 bytes (4 ZiB) per vdev.

... but the current software doesn't handle more than 2^64 bytes (16 EiB).

That is to say, when zdb prints ZFS DVAs it is not showing you the actual on-disk representation, or a lightly decoded version of it; instead the offset is silently converted from its on-disk form of 512-byte blocks to a version in bytes. I think that this is also true of other pieces of ZFS code that print DVAs as part of diagnostics, kernel messages, and so on. Based on lightly reading the code, I believe that the size of the DVA is also recorded on disk in 512-byte blocks, because zdb and other things use a similar C macro (DVA_GET_ASIZE()) when printing it.

(Both macros are #define'd in include/sys/spa.h.)

So, to summarize: on disk, ZFS DVA offsets are in units of 512-byte blocks, with offset 0 (on each disk) starting after a 4 Mbyte header. In addition, zdb prints offsets (and sizes) in units of bytes, not their on disk 512-byte blocks (in hexadecimal), as (probably) do other things. If zdb says that a given DVA is '0:7ea00:400', that is a byte offset of 518656 bytes and a byte size of 1024 bytes. Zdb is decoding these for you from their on disk form. If a kernel message talks about DVA '0:7ea00:400' it's also most likely using byte offsets, as zdb does.

These DVA block offsets are always for 512 byte blocks. The 'block size' of the offset is fixed, and doesn't depend on the physical block size of the disk, the logical block size of the disk, or the ashift of the vdev. Since 512 bytes is the block size for the minimum ashift, ZFS will never have to assign finer grained addresses than that, even if it's somehow dealing with a disk or other storage with smaller sized blocks. This makes using 512 byte 'blocks' completely safe. That the DVA offsets are in blocks is, in a sense, mostly a way of increasing how large that a vdev can be (adding nine bits of size).

(This is not as crazy a concern as you might think, since DVA offsets in a raidz vdev cover the entire (raw) disk space of the vdev. If you want to allow, say, 32 disk raidz vdevs without size limits, the disk size limit is 1/32nd of the vdev size limit. That's still a very big disk by today's standards, but if you're building a filesystem format that you expect may be used for (say) 50 years, you want to plan ahead.)

I haven't looked at the OpenZFS code in depth to see how it handles DVA offsets in the current code. The comments in include/sys/spa.h make it sound like all interpretation of the contents of DVAs go through the macros in the file, including the offset. The only apparent way to get access to the offset is with the DVA_GET_OFFSET() macro, which converts it to a byte offset in the process; this suggests that much or all ZFS code probably passes around DVA offsets as byte offsets, not their on-disk block offset form.

(This is somewhat suggested by what Matthew Ahrens said about how the current software behaves; if it deals primarily or only with byte offsets, it's limited to vdevs with at 2^64 bytes, although the disk format could accommodate larger ones. If all of the internal ZFS code deals with byte offsets, this might be part of why zdb prints DVAs as byte offsets; if you're going back and forth between a kernel debugger and zdb output, you want them to be in the same units.)

I'm disappointed that I was wrong (both yesterday and in the past), but at least I now have a definitive answer and I understand more about the situation.

Written on 31 August 2022.
« ZFS DVA offsets are in bytes, not (512-byte) blocks
Go 1.19 added an atomic.Pointer type that's a generic type »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed Aug 31 21:28:41 2022
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.