Wandering Thoughts archives

2014-04-26

What I can see about how ZFS deduplication seems to work on disk

There is a fair amount of high level information about how ZFS deduplication works. There is much less that I could find about some low level details of how deduplicated blocks exist on disk and some implications of how the on-disk data structures are stored. Since I was just looking this up in the current Illumos source code, I want to jot down some notes before it all falls out of my head again.

The core dedup data structure is the DDT, which holds the core information for each deduplicated block: the block's checksum, the number of references to the block, and the on-disk addresses of some number of copies of the block. Don't ask me exactly how many copies of the block there can be out there in the world; my head gets confused trying to follow the code. The DDT is stored as part of the overall pool metadata (via the ZAP) and as such the DDT is copy on write, just like pretty much everything else in ZFS. This makes total sense and is what you need.

Note that the DDT is global to the pool; it is not tied to any particular filesystem. As a pool level object it is not captured in filesystem snapshots any more than, say, information about which disk blocks are free is.

ZFS records where blocks are on disk through the use of 'block pointers' (which also include things like the block's checksum). An interesting question is whether the block pointer for a dedup'd block refers to the DDT in any way. The answer is that it doesn't; it points directly to the on-disk addresses of up to three copies of the block. So files with deduplicated blocks are read without referring to the DDT, at least if all goes well.

If configured to do so, ZFS can store more than one copy of a sufficiently highly referenced data block. As more and more references add up, ZFS will sooner or later create a second and perhaps a third and fourth copy. I believe that these additional copies will be used to recover from failed reads of the original copy of the data block even for things that were written before they existed, although these things don't directly contain references to the on-disk addresses of these additional 'ditto' copies. If I'm correct, this implies that failed reads may cause DDT access in order to see if such ditto blocks exist. Of course you probably don't care about any extra overhead from this if it saves your data.

In general, the more I look at this code the less confident I am that I have any understanding of the effects and consequences of turning off ZFS deduplication on a filesystem after you've turned it on. I suppose this just echoes what I said back in ZFSDedupBadDocumentation.

(There are other things about ZFS dedup that I don't understand after reading the code, but I'm going to save them for an appropriate ZFS mailing list.)

ZFSDedupStorage written at 03:33:19; Add Comment

2014-04-02

I'm angry that ZFS still doesn't have an API

Yesterday I wrote a calm rational explanation for why I'm not building tools around 'zpool status' any more and said that it ended up being only half of the story. The other half is that I am genuinely angry that ZFS still does not have any semblance of an API, so angry that I've decided to stop cooperating with ZFS's non-API and make my own.

(It's not the hot anger of swearing, it's the slow anger of a blister that keeps reminding you about its existence with every step you take.)

For at least the past six years it has been blindingly obvious that ZFS should have an API so that people could build additional tools and solutions on top of it. For all that is sane, stock ZFS doesn't even have an alerting solution for pool problems. You can't miss that unless you're blind and say whatever you want about the ZFS developers, I'm sure that they're not blind. I am and have been completely agnostic about the exact format that this API could have taken, so long as it existed. Stable, documented, script-friendly output from ZFS tools? A documented C level library API? XML information dumps because everyone loves XML? A web API? Whatever. I could have worked with any of them.

Instead we got nothing. We got nothing when ZFS was with Sun and despite some vague signs of care we continue to get exactly nothing now that ZFS is effectively with Illumos (and I'm pretty sure that Oracle hasn't fixed the situation either). At this point it is clear that the ZFS developers have different priorities and in an objective sense do not care about this issue.

(Regardless of what you say, what you actually care about is shown by what you work on.)

This situation has thoroughly gotten under my skin now that moving to OmniOS is rubbing my nose in it again. So now I'm through with tacitly cooperating with it by trying to wrestle and wrangle the ZFS commands to do what I want. Instead I feel like giving 'zpool status' and its friends a great big middle finger and then throwing them down a well. The only thing I want to use them for now is as a relatively authoritative source of truth if I suspect that something is wrong with what my own tools are showing me.

(I call zpool status et al 'relatively authoritative' because it and other similar commands leave things out and otherwise mangle what you are seeing, sometimes in ways that cause real problems.)

I will skip theories about why the ZFS developers did not develop an API (either in Sun or later), partly because I am in a bad mood after writing this and so am inclined to be extremely cynical.

ZFSNoAPIAnger written at 00:12:03; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.