ZFS DVA offsets are in 512-byte blocks on disk but zdb misleads you about them
Yesterday I asserted that ZFS DVA offsets were in bytes, based primarily on using zdb
to dump
a znode and then read a data block using the offset that zdb printed.
Over on Twitter, Matthew Ahrens corrected my misunderstanding:
The offset is stored on disk as a multiple of 512, see the DVA_GET_OFFSET() macro, which passes shift=SPA_MINBLOCKSHIFT=9. For human convenience, the DVA is printed in bytes (e.g. by zdb). So the on-disk format can handle up to 2^72 bytes (4 ZiB) per vdev.
... but the current software doesn't handle more than 2^64 bytes (16 EiB).
That is to say, when zdb
prints ZFS DVAs
it is not showing you the actual on-disk representation, or a lightly
decoded version of it; instead the offset is silently converted
from its on-disk form of 512-byte blocks to a version in bytes. I
think that this is also true of other pieces of ZFS code that print
DVAs as part of diagnostics, kernel messages, and so on. Based on
lightly reading the code, I believe that the size of the DVA is
also recorded on disk in 512-byte blocks, because zdb and other
things use a similar C macro (DVA_GET_ASIZE()) when printing
it.
(Both macros are #define'd in include/sys/spa.h.)
So, to summarize: on disk, ZFS DVA offsets are in units of 512-byte
blocks, with offset 0 (on each disk) starting after a 4 Mbyte
header. In addition, zdb
prints offsets (and sizes) in units
of bytes, not their on disk 512-byte blocks (in hexadecimal), as
(probably) do other things. If zdb says that a given DVA is
'0:7ea00:400', that is a byte offset of 518656 bytes and a byte
size of 1024 bytes. Zdb is decoding these for you from their on
disk form. If a kernel message talks about DVA '0:7ea00:400' it's
also most likely using byte offsets, as zdb does.
These DVA block offsets are always for 512 byte blocks. The 'block
size' of the offset is fixed, and doesn't depend on the physical
block size of the disk, the logical block size of the disk, or the
ashift
of the vdev.
Since 512 bytes is the block size for the minimum ashift
, ZFS
will never have to assign finer grained addresses than that, even
if it's somehow dealing with a disk or other storage with smaller
sized blocks. This makes using 512 byte 'blocks' completely safe.
That the DVA offsets are
in blocks is, in a sense, mostly a way of increasing how large that
a vdev can be (adding nine bits of size).
(This is not as crazy a concern as you might think, since DVA offsets in a raidz vdev cover the entire (raw) disk space of the vdev. If you want to allow, say, 32 disk raidz vdevs without size limits, the disk size limit is 1/32nd of the vdev size limit. That's still a very big disk by today's standards, but if you're building a filesystem format that you expect may be used for (say) 50 years, you want to plan ahead.)
I haven't looked at the OpenZFS code in depth to see how it handles DVA offsets in the current code. The comments in include/sys/spa.h make it sound like all interpretation of the contents of DVAs go through the macros in the file, including the offset. The only apparent way to get access to the offset is with the DVA_GET_OFFSET() macro, which converts it to a byte offset in the process; this suggests that much or all ZFS code probably passes around DVA offsets as byte offsets, not their on-disk block offset form.
(This is somewhat suggested by what Matthew Ahrens said about how the current software behaves; if it deals primarily or only with byte offsets, it's limited to vdevs with at 2^64 bytes, although the disk format could accommodate larger ones. If all of the internal ZFS code deals with byte offsets, this might be part of why zdb prints DVAs as byte offsets; if you're going back and forth between a kernel debugger and zdb output, you want them to be in the same units.)
I'm disappointed that I was wrong (both yesterday and in the past), but at least I now have a definitive answer and I understand more about the situation.
|
|