Why ZFS dedup is not something we can use

October 29, 2011

A commentator on my previous entry on enticing ZFS features suggested that ZFS deduplication (coming in Solaris 11) would be one of them. Unfortunately it's not; ZFS dedup is not something that we or any number of other sites are going to be able to use in the near future, and probably not ever. The core problem is its impact on performance, management, and system requirements for ZFS servers (all of them are bad).

Because this is somewhat long, I'm going to put the conclusion up here: ZFS deduplication is not currently useful for general purpose ZFS pools of any decent size. ZFS dedup can be useful in certain restricted situations and setups, but it's not something that you can turn on today in any general purpose ZFS filesharing environment.

To understand why, let's talk about what deduplication requires.

You can do deduplication at two different levels (at least); you can deduplicate identical files or you can deduplicate identical blocks (even if the overall files are not identical). ZFS's dedup is block level deduplication. In order to do deduplication (at any level), you need to keep track of all of the unique objects you actually have; what their identification is (generally some cryptographic hash), where they actually live, and how many different places use this object. Since ZFS is doing block level dedup, it needs to keep a record for every unique block in a dedup'd ZFS pool. This is generally called the dedup table for a pool.

You need to work with the dedup table every time you write things (to dedup what you're writing) or remove things (to decrease the underlying block's reference count and possibly free it). If all of the bits of the dedup table that you need are in memory, performance is good. If the dedup table is on fast secondary storage (such as a low-latency SSD), performance is acceptable. If you have to go read bits of the dedup table from slow spinning magnetic disks, your performance is horrible; in the degenerate case, every block written or removed requires at least one extra disk seek and incurs a disk read delay.

With one entry per allocated ZFS block, we're talking about a lot of data and thus a lot of RAM to get good performance. Various sources suggest that the dedup table will use at least 652 MBytes of RAM per terabyte of unique data that you have. But wait, we're not done yet; the dedup table is metadata, and ZFS only allows metadata to use 25% of memory (cf). So if you have just 5 TB of unique data, the minimum system memory to have the dedup table in memory is 32 GB. In practice you will need even more RAM because other ZFS metadata is going to want some of that space.

(That 652 MBytes figure assumes that all of your files use 128 KB blocks. Whether or not this is true apparently depends on a number of factors, including how many small files you have; a pool with a lot of small files will have a lower average blocksize and so a larger dedup table per unique terabyte. See here for a more pessimistic and thus more alarming calculation.)

In practice, no one with decent sized pools can afford to count on in-memory dedup tables, not unless you enjoy drastically overconfiguring your ZFS servers and then seeing much of the memory wasted. This makes L2ARC SSDs mandatory for dedup and also makes them a critical path for performance. How big an L2ARC SSD do you need? That depends on your estimate of unique TBs of data you're going to have, but given the performance impacts you don't want to underconfigure the SSD(s).

(Our situation is particularly bad because our data is distributed among a lot of pools and each pool needs its own L2ARC. Even if we partition the SSD(s) and give each pool a partition for L2ARC, things do not really look good. Plus we'd be talking to the L2ARCs over iSCSI, which has a latency impact.)

To make ZFS dedup make sense, you need at least one of three things. First, you can have not very much data so that your dedup table easily fits into memory (or an L2ARC SSD you already have). Second, you can know that you have a high duplication ratio, so that dedup will save you a significant amount of disk space and you will not need as much space for a dedup table. Third, you can be strikingly insensitive to the performance of writing and removing data (ideally partly because you are doing very little of it).

None of these are true of general purpose ZFS pools of any significant size, which is why no one sensible is going to turn on ZFS dedup on them any time soon. In many ways, large general purpose pools (or sets of pools) are much better off getting more disk space by just getting more disks.

Sidebar: the system management issue

The system management issue is something that I've seen catch people over and over again: removing a snapshot in a dedup'd pool is an expensive operation. Every time you free a block in a dedup'd pool you must touch its dedup table entry (either to decrease the reference count or to delete it entirely), and removing a snapshot often frees up megabytes or gigabytes of data. On zfs-discuss, I've seen horror stories of snapshot removals on dedup'd pools taking days on systems without enough RAM and L2ARC. And while this is happening your disks are maxed out on IOPS/second, probably killing performance for random IO. And you can't abort a snapshot removal once it's started; even rebooting the system won't help, because ZFS marks the snapshot in the pool as 'being removed, restart removal when this pool is re-activated'.


Comments on this page:

From 109.78.13.166 at 2011-10-29 12:02:49:

I completely agree. Block level dedup is very specialized and I'm surprised it's being introduced to general file systems. Though I'm biased I supposed having written a file dedup tool (fslint).

From 69.158.18.227 at 2011-10-29 14:04:54:

Another thing to add is that even if you have an SSD, the L2ARC entries on it still need to take up a bit of memory. So while having dedupe table (DDT) entries in RAM takes roughly 300-500 bytes per block, having them on the SSD does not bring the per-block space down to zero.

This is because you still need a (~180B) structure in RAM pointing to the 'real' entry in the L2ARC (i.e., the ARC entry points to the L2ARC entry). So having an L2ARC is helpful, it's not a panacea.

The size of each DDT entry depends on the release of Solaris/ZFS one is running. For any particular kernel, you can get the size (as a hex value) via:

   echo ::sizeof ddt_entry_t | mdb -k

To find the size of the ARC reference that points to the L2ARC, you do a:

   echo ::sizeof arc_buf_hdr_t | mdb -k

A good summary of getting a rough idea on the size you need can be found in this message:

   http://mail.opensolaris.org/pipermail/zfs-discuss/2011-May/048185.html

And as Chris mentions, on top of the DDT memory requirements, you also need RAM for the regular read and write caching.

Some threads on the topic:

   http://mail.opensolaris.org/pipermail/zfs-discuss/2011-April/thread.html#48026
   http://mail.opensolaris.org/pipermail/zfs-discuss/2011-May/thread.html#48185
From 50.82.222.56 at 2011-10-29 20:30:28:

Are you using zfs compression? I've had pretty good experience with the default lzbj compression saving space.

By cks at 2011-10-30 01:37:30:

We haven't tried turning on compression, and I haven't looked at any issues involved there.

Written on 29 October 2011.
« ZFS features that could entice us to upgrade Solaris versions
Why we have a VPN »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sat Oct 29 02:47:26 2011
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.