2011-10-29
Why ZFS dedup is not something we can use
A commentator on my previous entry on enticing ZFS features suggested that ZFS deduplication (coming in Solaris 11) would be one of them. Unfortunately it's not; ZFS dedup is not something that we or any number of other sites are going to be able to use in the near future, and probably not ever. The core problem is its impact on performance, management, and system requirements for ZFS servers (all of them are bad).
Because this is somewhat long, I'm going to put the conclusion up here: ZFS deduplication is not currently useful for general purpose ZFS pools of any decent size. ZFS dedup can be useful in certain restricted situations and setups, but it's not something that you can turn on today in any general purpose ZFS filesharing environment.
To understand why, let's talk about what deduplication requires.
You can do deduplication at two different levels (at least); you can deduplicate identical files or you can deduplicate identical blocks (even if the overall files are not identical). ZFS's dedup is block level deduplication. In order to do deduplication (at any level), you need to keep track of all of the unique objects you actually have; what their identification is (generally some cryptographic hash), where they actually live, and how many different places use this object. Since ZFS is doing block level dedup, it needs to keep a record for every unique block in a dedup'd ZFS pool. This is generally called the dedup table for a pool.
You need to work with the dedup table every time you write things (to dedup what you're writing) or remove things (to decrease the underlying block's reference count and possibly free it). If all of the bits of the dedup table that you need are in memory, performance is good. If the dedup table is on fast secondary storage (such as a low-latency SSD), performance is acceptable. If you have to go read bits of the dedup table from slow spinning magnetic disks, your performance is horrible; in the degenerate case, every block written or removed requires at least one extra disk seek and incurs a disk read delay.
With one entry per allocated ZFS block, we're talking about a lot of data and thus a lot of RAM to get good performance. Various sources suggest that the dedup table will use at least 652 MBytes of RAM per terabyte of unique data that you have. But wait, we're not done yet; the dedup table is metadata, and ZFS only allows metadata to use 25% of memory (cf). So if you have just 5 TB of unique data, the minimum system memory to have the dedup table in memory is 32 GB. In practice you will need even more RAM because other ZFS metadata is going to want some of that space.
(That 652 MBytes figure assumes that all of your files use 128 KB blocks. Whether or not this is true apparently depends on a number of factors, including how many small files you have; a pool with a lot of small files will have a lower average blocksize and so a larger dedup table per unique terabyte. See here for a more pessimistic and thus more alarming calculation.)
In practice, no one with decent sized pools can afford to count on in-memory dedup tables, not unless you enjoy drastically overconfiguring your ZFS servers and then seeing much of the memory wasted. This makes L2ARC SSDs mandatory for dedup and also makes them a critical path for performance. How big an L2ARC SSD do you need? That depends on your estimate of unique TBs of data you're going to have, but given the performance impacts you don't want to underconfigure the SSD(s).
(Our situation is particularly bad because our data is distributed among a lot of pools and each pool needs its own L2ARC. Even if we partition the SSD(s) and give each pool a partition for L2ARC, things do not really look good. Plus we'd be talking to the L2ARCs over iSCSI, which has a latency impact.)
To make ZFS dedup make sense, you need at least one of three things. First, you can have not very much data so that your dedup table easily fits into memory (or an L2ARC SSD you already have). Second, you can know that you have a high duplication ratio, so that dedup will save you a significant amount of disk space and you will not need as much space for a dedup table. Third, you can be strikingly insensitive to the performance of writing and removing data (ideally partly because you are doing very little of it).
None of these are true of general purpose ZFS pools of any significant size, which is why no one sensible is going to turn on ZFS dedup on them any time soon. In many ways, large general purpose pools (or sets of pools) are much better off getting more disk space by just getting more disks.
Sidebar: the system management issue
The system management issue is something that I've seen catch people over and over again: removing a snapshot in a dedup'd pool is an expensive operation. Every time you free a block in a dedup'd pool you must touch its dedup table entry (either to decrease the reference count or to delete it entirely), and removing a snapshot often frees up megabytes or gigabytes of data. On zfs-discuss, I've seen horror stories of snapshot removals on dedup'd pools taking days on systems without enough RAM and L2ARC. And while this is happening your disks are maxed out on IOPS/second, probably killing performance for random IO. And you can't abort a snapshot removal once it's started; even rebooting the system won't help, because ZFS marks the snapshot in the pool as 'being removed, restart removal when this pool is re-activated'.
2011-10-28
ZFS features that could entice us to upgrade Solaris versions
I've written before about how our fileservers are basically appliances and so don't patched because we don't like taking any risks of destabilizing vital core services. Today they run what is more or less Solaris 10 Update 8 (which causes some problems), and today I've gotten interested in inventorying what features from subsequent Solaris versions might be attractive enough to us to cause us to upgrade.
(Note that there are a lot of ZFS features in S10U9 and S10U10 that will be attractive to other people but not us. I'm being selfish and just looking at what we care about.)
Solaris 10 Update 9 introduced two important changes: log device removal and pool recovery. We don't currently use log devices because we don't think that we have any pool that could really benefit from them (especially once we add iSCSI overhead on top of the log devices), but if we ever did need to add an SSD log device to a hot pool, we'd want this change.
My impression so far is that pool recovery does not require a ZFS pool version upgrade and so you can perform it by just keeping a spare S10U9 system around (or a S10U10 one). Perhaps we should build such a system, just in case. And certainly it might be a good idea to test this assumption.
Solaris 10 Update 10 adds more improvements to pool recovery (and a lot of features that we don't care about). Again, it's not clear to me if this recovery works on old pool versions or if you have to upgrade your There are of course
The more I look at this, the more I think that I need to build a current Solaris install just to have it sitting around. Fortunately we have plenty of spare hardware.
(This is one of the powers of blogging. Initially I set out to write a rather different entry for today, but when I started doing my research everything wound up shifting around and now I have a new project at work.)
Sidebar: Solaris release information resources
Since I found this once: Solaris 10 update 10 (8/11), Solaris 10 update 9 (9/10). Let's hope Oracle doesn't decide to change the URLs for these again. Note that these are not complete feature lists; they don't mention things like ZFS performance improvements.
Also, ZFS pool and filesystem versions, although that doesn't cover S10U10. It points to the ZFS file system guide, which has a what's-new feature list.