Wandering Thoughts archives

2014-05-21

How I wish ZFS pool importing could work

I've mentioned before that one of our problems is that explicit 'zpool import' commands are very slow in our environment, so slow that we don't try to do failover although we're theoretically set up to do it. At least back in the Solaris era and I assume still in the OmniOS one, this came about because of two reasons. First, when you run 'zpool import' (for basically any reason) it checks every disk you have, one at a time, to build up a mapping of what ZFS labels are where and so on. Back when I timed it, this seemed to take roughly a third of a second per visible 'disk' (a real disk or an iSCSI LUN). Second, when your zpool import command finishes it promptly throws away all of that slowly and expensively gathered information so the next 'zpool import' command you run has to do it all over again. Both of these together combine unpleasantly in typical failover scenarios. You might do one 'zpool import' to confirm that all the pools you want to import are fully visible and then 'zpool import' five or six pools (one at a time, because you can't import multiple pools at once with a normal 'zpool import' command). The resulting time consumption adds up fast.

What I would like is for a way to have ZFS pool imports fix both problems. Sequential disk probing is an easy fix; just don't do that. Scanning some number of disks in parallel ought to significantly speed things up and even modest levels of parallelism offer potentially big wins (eg, doing two disks in parallel could theoretically halve the time necessary).

There are two potential fixes for the problem of 'zpool import' throwing away all of that work. The simpler is to make it possible to import multiple pools in a single 'zpool import' operation. There's no fundamental obstacle in the code for this, it's just a small matter of creating a command line syntax for it and then basically writing a loop over the import operation (right now giving two pool names renames a pool on import and giving more than two is a syntax error). The bigger fix is to provide an option for zpool import to not throw away the work, letting it write out the accumulated information to a cache file and then reload it under suitable conditions (both should require a new command line switch). If the import process finds that the on-disk reality doesn't match the cache file's data, it falls back to doing the current full scan (checking disks in parallel, please).

At this point some people will be tempted to suggest ZFS cache files. Unfortunately these are not a solution for at least two reasons. First, you can't use ZFS cache files to accelerate a scan for what pools are available for import; a plain 'zpool import' doesn't take a '-c cachefile' argument. Second, there's no way to build or rebuild ZFS cache files without actually importing pools. This makes managing them very painful in practice, for example you can't have a single ZFS cache file with a global view of all pools available on your shared storage unless you import them all on one system and then save the resulting cache file.

(Scanning for visible pools matters in failover on shared storage because you really want to make sure that the machine you're failing over to can see all of the shared storage that it should. In fact I'd like a ZFS pool import option for 'do not import pools unless all of their devices are visible'; we'd certainly use it by default because in most situations in our environment we'd rather a pool not import at all than import with mirrors broken because eg an iSCSI target was accidentally not configured on one server.)

ZFSPoolImportWish written at 01:32:27; Add Comment

2014-05-02

An important addition to how ZFS deduplication works on the disk

My entry on how ZFS deduplication works on the disk turns out to have missed one important aspect of how deduplication affects the on-disk ZFS data. Armed with this information we can finally answer some long-standing uncertainties about ZFS deduplication.

As I mentioned in passing earlier, ZFS uses block pointers to describe where the actual data for blocks are. Block pointers have the data virtual addresses of up to three copies of the block's data, the block's checksum, and a number of other bits and pieces. Crucially, block pointers are specially marked if they were written with deduplication on. It is the deduplication flag in any particular block pointer that controls what happens when the block pointer is deleted. If the flag is on, the delete does a DDT lookup so that the reference counts can be maintained; if the flag is off, there's no DDT lookup needed.

(When the reference count of a DDT entry goes to zero, the DDT entry itself gets deleted. A ZFS pool always has DDT tables, even if they're empty.)

As mentioned in the first entry, deduplication has basically no effects on reads because reads of a dedup'd BP don't normally involve the DDT since the BP contains the DVAs of some copies of the block and ZFS will just read directly from these. However if there is a read error on a dedup'd BP, ZFS does a DDT lookup to see if there's another copy of the block available (for example in the 'ditto' copies).

(I'm waving my hands about deduplication's potential effects on how fragmented a file's data gets on the disk.)

Only file data is deduplicated. ZFS metadata like directories is not subject to deduplication and so block pointers for metadata blocks will never be dedup'd BPs. This is pretty much what you'd expect but I feel like mentioning it explicitly since I just checked this in the code.

So turning ZFS deduplication on does not irreversibly taint anything as far as I can see. Any data written while deduplication is on will be marked as a dedup'd BP and then when it's deleted you'll hit the DDT, but after deduplication is turned off and all of that data is deleted the DDT should be empty again. And if you never delete any of the data the only effect is that the DDT will sit there taking up some extra space. But you will take the potential deduplication hit when you delete data written while deduplication is on, even if you later turn it off, and this includes deleting snapshots.

Sidebar: Deduplication and ZFS scrubs

As you'd expect, ZFS scrubs and resilvers do check and correct DDT entries, and they check all DVAs that DDT entries point to (even ditto blocks, which are not directly referred to by any normal data BPs). The scanning code tries to do DDT and file data checks efficiently, basically checking DDT entries and the DVAs they point to once no matter how many references they have. The exact mechanisms are a little bit complicated.

(My paranoid instincts see corner cases with this code, but I'm probably wrong. And if they happened they would probably be the result of ZFS code bugs, not disk IO errors.)

ZFSDedupStorageII written at 01:49:29; Add Comment

By day for May 2014: 2 21; before May; after May.

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.