What ZFS block pointers are and what's in them

June 24, 2018

I've mentioned ZFS block pointers in the past; for example, when I wrote about some details of ZFS DVAs, I said that DVAs are embedded in block pointers. But I've never really looked carefully at what is in block pointers and what that means and implies for ZFS.

The very simple way to describe a ZFS block pointer is that it's what ZFS uses in places where other filesystems would simply put a block number. Just like block numbers but unlike things like ZFS dnodes, a block pointer isn't a separate on-disk entity; instead it's an on disk data format and an in memory structure that shows up in other things. To quote from the (draft and old) ZFS on-disk specification (PDF):

A block pointer (blkptr_t) is a 128 byte ZFS structure used to physically locate, verify, and describe blocks of data on disk.

Block pointers are embedded in any ZFS on disk structure that points directly to other disk blocks, both for data and metadata. For instance, the dnode for a file contains block pointers that refer to either its data blocks (if it's small enough) or indirect blocks, as I saw in this entry. However, as I discovered when I paid attention, most things in ZFS only point to dnodes indirectly, by giving their object number (either in a ZFS filesystem or in pool-wide metadata).

So what's in a block pointer itself? You can find the technical details for modern ZFS in spa.h, so I'm going to give a sort of summary. A regular block pointer contains:

  • various metadata and flags about what the block pointer is for and what parts of it mean, including what type of object it points to.

  • Up to three DVAs that say where to actually find the data on disk. There can be more than one DVA because you may have set the copies property to 2 or 3, or this may be metadata (which normally has two copies and may have more for sufficiently important metadata).

  • The logical size (size before compression) and 'physical' size (the nominal size after compression) of the disk block. The physical size can do odd things and is not necessarily the asize (allocated size) for the DVA(s).

  • The txgs that the block was born in, both logically and physically (the physical txg is apparently for dva[0]). The physical txg was added with ZFS deduplication but apparently also shows up in vdev removal.

  • The checksum of the data the block pointer describes. This checksum implicitly covers the entire logical size of the data, and as a result you must read all of the data in order to verify it. This can be an issue on raidz vdevs or if the block had to use gang blocks.

Just like basically everything else in ZFS, block pointers don't have an explicit checksum of their contents. Instead they're implicitly covered by the checksum of whatever they're embedded in; the block pointers in a dnode are covered by the overall checksum of the dnode, for example. Block pointers must include a checksum for the data they point to because such data is 'out of line' for the containing object.

(The block pointers in a dnode don't necessarily point straight to data. If there's more than a bit of data in whatever the dnode covers, the dnode's block pointers will instead point to some level of indirect block, which itself has some number of block pointers.)

There is a special type of block pointer called an embedded block pointer. Embedded block pointers directly contain up to 112 bytes of data; apart from the data, they contain only the metadata fields and a logical birth txg. As with conventional block pointers, this data is implicitly covered by the checksum of the containing object.

Since block pointers directly contain the address of things on disk (in the form of DVAs), they have to change any time that address changes, which means any time ZFS does its copy on write thing. This forces a change in whatever contains the block pointer, which in turn ripples up to another block pointer (whatever points to said containing thing), and so on until we eventually reach the Meta Object Set and the uberblock. How this works is a bit complicated, but ZFS is designed to generally make this a relatively shallow change with not many levels of things involved (as I discovered recently).

As far as I understand things, the logical birth txg of a block pointer is the transaction group in which the block pointer was allocated. Because of ZFS's copy on write principle, this means that nothing underneath the block pointer has been updated or changed since that txg; if something changed, it would have been written to a new place on disk, which would have forced a change in at least one DVA and thus a ripple of updates that would update the logical birth txg.

However, this doesn't quite mean what I used to think it meant because of ZFS's level of indirection. If you change a file by writing data to it, you will change some of the file's block pointers, updating their logical birth txg, and you will change the file's dnode. However, you won't change any block pointers and thus any logical birth txgs for the filesystem directory the file is in (or anything else up the directory tree), because the directory refers to the file through its object number, not by directly pointing to its dnode. You can still use logical birth txgs to efficiently find changes from one txg to another, but you won't necessarily get a filesystem level view of these changes; instead, as far as I can see, you will basically get a view of what object(s) in a filesystem changed (effectively, what inode numbers changed).

(ZFS has an interesting hack to make things like 'zfs diff' work far more efficiently than you would expect in light of this, but that's going to take yet another entry to cover.)

Written on 24 June 2018.
« A broad overview of how ZFS is structured on disk
Twitter probably isn't for you or me any more »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sun Jun 24 23:19:40 2018
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.