A lingering sign of old hopes for ZFS deduplication

January 21, 2021

Over on Twitter, I said:

It's funny-sad that ZFS dedup was considered such an important feature when it launched that 'zpool list' had a DEDUP field added, even for systems with no dedup ever enabled. Maybe someday zpool status will drop that field in the default output.

For people who have never seen it, here is 'zpool list' output on a current (development) version of OpenZFS on Linux:

; zpool list
ssddata  596G   272G  324G        -         -   40%  45%  1.00x  ONLINE  -

The DEDUP field is the ratio of space saved by deduplication, expressed as a multiplier (from the allocated space after deduplication to what it would be without deduplication). It's always present in default 'zpool list' output, and since almost all ZFS pools don't use deduplication, it's almost always 1.00x.

It seems very likely that Sun and the Solaris ZFS developers had great hope for ZFS deduplication when the feature was initially launched. Certainly the feature was very attention getting and superficially attractive; back a decade ago, people had heard of it and would recommend it casually, although actual Solaris developers were more nuanced. It seems very likely that the presence of a DEDUP field in the default 'zpool list' output is a product of an assumption that ZFS deduplication would be commonly used and so showing the field was both useful and important.

However, things did not turn out that way. ZFS deduplication is almost never used, because once people tried to use it for real they discovered that it was mostly toxic, primarily because of high memory requirements. Yet the DEDUP field lingers on in the default 'zpool list' output, and people like me can see it as a funny and sad reminder of the initial hopes for ZFS deduplication.

(OpenZFS could either remove it or, if possible, replace it with the overall compression ratio multiplier for the pool, since many pools these days turn on compression. You would still want to have DEDUP available as a field in some version of 'zpool list' output, since the information doesn't seem to be readily available anywhere else.)

PS: Since I looked it up, ZFS deduplication was introduced in Oracle Solaris 11 for people using Solaris, which came out in November of 2011. It was available earlier for people using OpenSolaris, Illumos, and derivatives. Wikipedia says that it was added to OpenSolaris toward the end of 2009 and first appeared in OpenSolaris build 128, released in early December of 2009.

Comments on this page:

By Opk at 2021-01-22 08:05:53:

I could be very wrong in my understanding of how zfs dedup works but I've often turned it on for the initial data population. So I turn it on, rsync or zfs send in my data and then I turn it off again. I don't care about memory usage during the initial setup of a system so I assume this is not doing much harm. No new data will be deduped afterwards because it isn't tracking hashes of blocks anymore and the DEDUP value will reduce over time. But it tends to stay fairly well above 1.00.

Solaris 11.4's zfs supports reflink copies which I'd guess uses the same mechanism underneath. I've attempted to use the -D option to zfs send a couple of times but it didn't appear to do much. It'd be nice to have some sort of offline dedup feature which could be run during quiet hours. Or a way to run the dedup on an idle replica machine and use the results to dedup data on the prime.

Have the massive price drops for SSD's since 2010 change your opinion on this at all? It seems like the performance hit is quite bad if you have to do random seeks on a spinning disk, but it's ok on SSD's, and you can get a 100 GB SSD for $20 these days.

I'm curious if the problem is that it's still bad performance with an SSD, that it's too complicated/risky, or just that even at those prices your money is better spent on more spinning disks?

Ahrens has spent several conferences pre-COVID-19 talking about how deduplication in ZFS can still be good, and he's even drawn up a ten-thousand-foot overview on a bar napkin at some point. All that's really needed is for a company, which wants to make use of deduplication, to actually implement it.

If I recall correctly, it conceptually involves using something akin to L2ARC/SLOG or allocation classes, whereby deduplication recordkeeping data is stored on a (mirrored pair of) (NVMe) SSD(s) instead of in memory, reducing the size of each record (it used to be 300 bytes, now it's 70, and it can be even lower), along with making changes to it that could speed it up 100-1000x.

By Manek Dubash at 2021-01-26 07:21:25:

On the other hand, if you're using rsync for incremental backups, you achieve deduplication using the --link-dest=DIR option.

By John Wiersba at 2021-03-07 16:10:08:

@Manek Dubash, unfortunately, although deduplication using hardlinks will deduplicate the data, it will lose any distinctions in the metadata of the hardlinked files.

It does have one advantage over reflinked files, though. With hardlinks, it is possible to easily determine (via the inode number) the relationship between the files: are they hardlinked or not? With reflinked files, not only is it hard to see that two files are reflinked, but any unique ID which could be used to see that (such as a hash of the block IDs) is subject to change at any time by the filesystem, outside of the user's control. At least this is the case for btrfs.

Written on 21 January 2021.
« Real email has MIME attachments that are HTML
SMART Threshold numbers turn out to not be useful for us in practice »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Thu Jan 21 23:01:10 2021
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.