What I can see about how ZFS deduplication seems to work on disk

April 26, 2014

There is a fair amount of high level information about how ZFS deduplication works. There is much less that I could find about some low level details of how deduplicated blocks exist on disk and some implications of how the on-disk data structures are stored. Since I was just looking this up in the current Illumos source code, I want to jot down some notes before it all falls out of my head again.

The core dedup data structure is the DDT, which holds the core information for each deduplicated block: the block's checksum, the number of references to the block, and the on-disk addresses of some number of copies of the block. Don't ask me exactly how many copies of the block there can be out there in the world; my head gets confused trying to follow the code. The DDT is stored as part of the overall pool metadata (via the ZAP) and as such the DDT is copy on write, just like pretty much everything else in ZFS. This makes total sense and is what you need.

Note that the DDT is global to the pool; it is not tied to any particular filesystem. As a pool level object it is not captured in filesystem snapshots any more than, say, information about which disk blocks are free is.

ZFS records where blocks are on disk through the use of 'block pointers' (which also include things like the block's checksum). An interesting question is whether the block pointer for a dedup'd block refers to the DDT in any way. The answer is that it doesn't; it points directly to the on-disk addresses of up to three copies of the block. So files with deduplicated blocks are read without referring to the DDT, at least if all goes well.

If configured to do so, ZFS can store more than one copy of a sufficiently highly referenced data block. As more and more references add up, ZFS will sooner or later create a second and perhaps a third and fourth copy. I believe that these additional copies will be used to recover from failed reads of the original copy of the data block even for things that were written before they existed, although these things don't directly contain references to the on-disk addresses of these additional 'ditto' copies. If I'm correct, this implies that failed reads may cause DDT access in order to see if such ditto blocks exist. Of course you probably don't care about any extra overhead from this if it saves your data.

In general, the more I look at this code the less confident I am that I have any understanding of the effects and consequences of turning off ZFS deduplication on a filesystem after you've turned it on. I suppose this just echoes what I said back in ZFSDedupBadDocumentation.

(There are other things about ZFS dedup that I don't understand after reading the code, but I'm going to save them for an appropriate ZFS mailing list.)

Written on 26 April 2014.
« A Unix semantics issue if your filesystem can snapshot arbitrary directories
Thoughts about Python classes as structures and optimization »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sat Apr 26 03:33:19 2014
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.