Wandering Thoughts archives


Why ZFS dedup is not something we can use

A commentator on my previous entry on enticing ZFS features suggested that ZFS deduplication (coming in Solaris 11) would be one of them. Unfortunately it's not; ZFS dedup is not something that we or any number of other sites are going to be able to use in the near future, and probably not ever. The core problem is its impact on performance, management, and system requirements for ZFS servers (all of them are bad).

Because this is somewhat long, I'm going to put the conclusion up here: ZFS deduplication is not currently useful for general purpose ZFS pools of any decent size. ZFS dedup can be useful in certain restricted situations and setups, but it's not something that you can turn on today in any general purpose ZFS filesharing environment.

To understand why, let's talk about what deduplication requires.

You can do deduplication at two different levels (at least); you can deduplicate identical files or you can deduplicate identical blocks (even if the overall files are not identical). ZFS's dedup is block level deduplication. In order to do deduplication (at any level), you need to keep track of all of the unique objects you actually have; what their identification is (generally some cryptographic hash), where they actually live, and how many different places use this object. Since ZFS is doing block level dedup, it needs to keep a record for every unique block in a dedup'd ZFS pool. This is generally called the dedup table for a pool.

You need to work with the dedup table every time you write things (to dedup what you're writing) or remove things (to decrease the underlying block's reference count and possibly free it). If all of the bits of the dedup table that you need are in memory, performance is good. If the dedup table is on fast secondary storage (such as a low-latency SSD), performance is acceptable. If you have to go read bits of the dedup table from slow spinning magnetic disks, your performance is horrible; in the degenerate case, every block written or removed requires at least one extra disk seek and incurs a disk read delay.

With one entry per allocated ZFS block, we're talking about a lot of data and thus a lot of RAM to get good performance. Various sources suggest that the dedup table will use at least 652 MBytes of RAM per terabyte of unique data that you have. But wait, we're not done yet; the dedup table is metadata, and ZFS only allows metadata to use 25% of memory (cf). So if you have just 5 TB of unique data, the minimum system memory to have the dedup table in memory is 32 GB. In practice you will need even more RAM because other ZFS metadata is going to want some of that space.

(That 652 MBytes figure assumes that all of your files use 128 KB blocks. Whether or not this is true apparently depends on a number of factors, including how many small files you have; a pool with a lot of small files will have a lower average blocksize and so a larger dedup table per unique terabyte. See here for a more pessimistic and thus more alarming calculation.)

In practice, no one with decent sized pools can afford to count on in-memory dedup tables, not unless you enjoy drastically overconfiguring your ZFS servers and then seeing much of the memory wasted. This makes L2ARC SSDs mandatory for dedup and also makes them a critical path for performance. How big an L2ARC SSD do you need? That depends on your estimate of unique TBs of data you're going to have, but given the performance impacts you don't want to underconfigure the SSD(s).

(Our situation is particularly bad because our data is distributed among a lot of pools and each pool needs its own L2ARC. Even if we partition the SSD(s) and give each pool a partition for L2ARC, things do not really look good. Plus we'd be talking to the L2ARCs over iSCSI, which has a latency impact.)

To make ZFS dedup make sense, you need at least one of three things. First, you can have not very much data so that your dedup table easily fits into memory (or an L2ARC SSD you already have). Second, you can know that you have a high duplication ratio, so that dedup will save you a significant amount of disk space and you will not need as much space for a dedup table. Third, you can be strikingly insensitive to the performance of writing and removing data (ideally partly because you are doing very little of it).

None of these are true of general purpose ZFS pools of any significant size, which is why no one sensible is going to turn on ZFS dedup on them any time soon. In many ways, large general purpose pools (or sets of pools) are much better off getting more disk space by just getting more disks.

Sidebar: the system management issue

The system management issue is something that I've seen catch people over and over again: removing a snapshot in a dedup'd pool is an expensive operation. Every time you free a block in a dedup'd pool you must touch its dedup table entry (either to decrease the reference count or to delete it entirely), and removing a snapshot often frees up megabytes or gigabytes of data. On zfs-discuss, I've seen horror stories of snapshot removals on dedup'd pools taking days on systems without enough RAM and L2ARC. And while this is happening your disks are maxed out on IOPS/second, probably killing performance for random IO. And you can't abort a snapshot removal once it's started; even rebooting the system won't help, because ZFS marks the snapshot in the pool as 'being removed, restart removal when this pool is re-activated'.

solaris/ZFSDedupMemoryProblem written at 02:47:26; Add Comment

Page tools: See As Normal.
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.