Deduplication is always going to be expensive

October 31, 2011

I've seen a reaction or two to my entry on ZFS deduplication that suggest that the reason ZFS dedup has significant issues is that the Solaris programmers did a bad job on it. I want to set the record straight: this is false. Deduplication is fundamentally not an easy operation and no form of deduplication can be fast unless it is fed huge amounts of resources.

Any deduplication system needs to maintain a dedup table in some form. It's the core data structure because the core operation in deduplication is 'given this new object to be written, do I already have a copy?' The performance of any dedup system depends on how fast it can look up and modify random entries in this table, while the size of this table depends primarily on how many unique entities the system is tracking. All dedup systems want to keep their dedup tables in RAM for good performance on writes and deletes.

(Dedup table IO is random because the dedup checksum has to be assumed to be randomly distributed. This is certainly the case for all good cryptographic hashes.)

Picking block level deduplication is not a ZFS mistake. Block level deduplication is more powerful and more effective than file level deduplication, at the cost of keeping track of more unique entities. Block level dedup is also much easier than file level dedup to add to a Unix filesystem for various reasons that all devolve to 'writing blocks is a fundamental filesystem activity and writing files is not'.

In short, dedup inherently requires bunches of RAM; the more you dedup, the more memory. Block level dedup requires more memory than file level dedup, but gets better results.

There are only two and a half areas where ZFS dedup really makes life worse for itself. The half is that ZFS dedup is done at the pool level instead of the filesystem level and pools are general larger (sometimes significantly so) than you would make any single filesystem. All else being equal, the larger the entity that you do deduplication over, the more unique entities you'll be dealing with and the larger your dedup table will be.

(Actually this is only half true; if you have N terabytes of disk space and a fixed block size, you can't magically shrink the collective dedup tables you need by slicing the space up into smaller chunks. If anything, smaller chunks will result in more total dedup table space because there is no cross-filesystem sharing of duplicate content. What you gain is potentially needing less RAM to hold the dedup tables of the active filesystem areas, if only some of your filesystems see much write traffic.)

The genuinely unforced ZFS error is artificially limiting how much RAM the dedup table can take up to less than 25% of system RAM. I actually think that this is less significant than it looks, because I think that this only costs you a factor of two or three for how large a ZFS pool you can dedup in RAM and this is often not going to be enough.

The final ZFS issue is that deleting a snapshot is a uniquely bad operation for any dedup system because it causes a huge amount of object deletions in a single action that's generally uninterruptible. ZFS could do better than it does, but only at the expense of mutating a snapshot that's in the process of being deleted (which would enable genuinely suspending a snapshot removal). The other way you can get this effect on any block level dedup system is to delete a sufficiently large file.

(Even today removing a sufficiently large file can be surprisingly slow on many filesystems, precisely because file removal doesn't report success until it has freed up all blocks associated with the file. It can take a significant amount of disk access just to determine what all of these blocks are for a large file.)


Comments on this page:

From 109.78.103.253 at 2011-10-31 09:03:32:

A couple of comments about the performance of file level dedup.

One can consider the size of a file as a first level ID. I.E. one can exclude files from the set to checksum with unique sizes. That speeds dedup a lot in most cases.

One thing that file systems could provide to help with file level identification, would be auto expired attributes. I.E. a set of (extended) attributes that would be auto invalidated on write() You could cache checksums there which would help with rsync too.

By cks at 2011-10-31 14:07:16:

I've been kind of assuming that computing the dedup checksum is free for a filesystem implementation, although as you note that's not quite the case for a file-based dedup system. My view is that even if you get the checksum for free, the dedup table issue is serious and makes dedup expensive.

(You can get a file checksum for free only if the file is written as a sequential stream. This is a common case but not the only one.)

From 138.92.10.18 at 2011-11-03 12:49:08:

I think it's important to point out that ZFS does in-line block-based deduplication, and that requires sufficient amounts of cache which can be provided via RAM and an optional (but highly recommended) CACHE device (or striped CACHE devices; preferably SSD).

Compare it to NetApp's implementation of dedupe (A-SIS). A-SIS is also block-based, but it is post-process, meaning once a day a process will analyze the blocks for duplicates. That does require RAM, but not as much as an in-line implementation. NetApp also imposes limits on the size of a flexvol that can be deduped, based upon the model (essentially CPU and RAM).

ZFS, in its current state, is definitely a file system that requires some thought about the hardware configuration.

Since I started this with a comment, I'll redirect by making another one regarding ZFS and Solaris 11: Solaris 11 supports encrypted ZFS. :))

From 109.76.102.66 at 2011-11-08 07:37:04:

Your post on http://utcc.utoronto.ca/~cks/space/blog/unix/FundamentalFileOperation got me thinking of the races involved, with the auto invalidated attributes idea in comment 1 above.

If user space was updating the attributes while the file was being written to, it might add the wrong attribute values. Also one has to handle multple processes trying to update these attributes. So the process would have to be something like:

user_prog read attribute and ensure missing || ! 'locked' value
user_prog del attribute (if it was present)
user_prog set attribute to 'locked' (fails if attribute present)
user_prog read file and generate checksum
user_prog update attribute with checksum (fails if ! present)

I.E. to avoid races, we'd need an attribute 'set', 'update' to operate atomically as above. Awkward. Also what if the user space prog dies without clearing the locked value. Bah, looks like we need higher level functions provided by the file system for this, with the above operations done internally by the file system :( Or perhaps provide mandatory locking of these file attributes.

Written on 31 October 2011.
« Why we have a VPN
Why 'quiet' options to programs aren't as useful as you think »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Oct 31 01:55:09 2011
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.