2014-05-02
An important addition to how ZFS deduplication works on the disk
My entry on how ZFS deduplication works on the disk turns out to have missed one important aspect of how deduplication affects the on-disk ZFS data. Armed with this information we can finally answer some long-standing uncertainties about ZFS deduplication.
As I mentioned in passing earlier, ZFS uses block pointers to describe where the actual data for blocks are. Block pointers have the data virtual addresses of up to three copies of the block's data, the block's checksum, and a number of other bits and pieces. Crucially, block pointers are specially marked if they were written with deduplication on. It is the deduplication flag in any particular block pointer that controls what happens when the block pointer is deleted. If the flag is on, the delete does a DDT lookup so that the reference counts can be maintained; if the flag is off, there's no DDT lookup needed.
(When the reference count of a DDT entry goes to zero, the DDT entry itself gets deleted. A ZFS pool always has DDT tables, even if they're empty.)
As mentioned in the first entry, deduplication has basically no effects on reads because reads of a dedup'd BP don't normally involve the DDT since the BP contains the DVAs of some copies of the block and ZFS will just read directly from these. However if there is a read error on a dedup'd BP, ZFS does a DDT lookup to see if there's another copy of the block available (for example in the 'ditto' copies).
(I'm waving my hands about deduplication's potential effects on how fragmented a file's data gets on the disk.)
Only file data is deduplicated. ZFS metadata like directories is not subject to deduplication and so block pointers for metadata blocks will never be dedup'd BPs. This is pretty much what you'd expect but I feel like mentioning it explicitly since I just checked this in the code.
So turning ZFS deduplication on does not irreversibly taint anything as far as I can see. Any data written while deduplication is on will be marked as a dedup'd BP and then when it's deleted you'll hit the DDT, but after deduplication is turned off and all of that data is deleted the DDT should be empty again. And if you never delete any of the data the only effect is that the DDT will sit there taking up some extra space. But you will take the potential deduplication hit when you delete data written while deduplication is on, even if you later turn it off, and this includes deleting snapshots.
Sidebar: Deduplication and ZFS scrubs
As you'd expect, ZFS scrubs and resilvers do check and correct DDT entries, and they check all DVAs that DDT entries point to (even ditto blocks, which are not directly referred to by any normal data BPs). The scanning code tries to do DDT and file data checks efficiently, basically checking DDT entries and the DVAs they point to once no matter how many references they have. The exact mechanisms are a little bit complicated.
(My paranoid instincts see corner cases with this code, but I'm probably wrong. And if they happened they would probably be the result of ZFS code bugs, not disk IO errors.)