Some views and notes on ZFS deduplication today
I recently wrote an entry about a lingering sign of old hopes for ZFS deduplication, and got a number of good comments on it that I have reactions and views about. First off, Opk said:
I could be very wrong in my understanding of how zfs dedup works but I've often turned it on for the initial data population. So I turn it on, rsync or zfs send in my data and then I turn it off again. I don't care about memory usage during the initial setup of a system so I assume this is not doing much harm. [...]
How much potential harm this does depends on what you do with the data that was written with deduplication on. If you leave the data sitting there, this is relatively harmless. However, if you delete the data (including overwriting data in files 'in place'), then ZFS must update the DDT (deduplication table) to correctly maintain the reference count of each unique data block. If you don't have enough memory to hold all of the DDT, then this is going to require disk reads to page chunks of it in and out. The amount of reading and slowdown goes up as you delete more and more data at once, for example if you delete an entire snapshot or filesystem.
(This is a classical surprise issue with deduplication, going back to early days. People are disconcerted when operations like 'zfs destroy <snapshot>' sit there for ages, or at least run in the background for ages even if the command returns immediately.)
Brendan Long asked:
Have the massive price drops for SSD's since 2010 change your opinion on this at all? It seems like the performance hit is quite bad if you have to do random seeks on a spinning disk, but it's ok on SSD's, and you can get a 100 GB SSD for $20 these days.
I'm not sure if it's okay on SSDs, so here's my view. Reads aren't slowed by being deduplicated, but writes (and deletes) require a synchronous check of the DDT for every block, which means a synchronous SSD read IO if the necessary section of the DDT isn't in RAM. It's not clear to me what latency SSDs have for isolated synchronous reads, but my vaguely measured numbers suggest that we should assume on at least a couple of milliseconds per read.
I haven't read the ZFS code, so I don't know if it performs DDT checking
serially as it processes each block being written or deleted (which
would be a natural approach), or if it somehow batches the checks up to
issue them in parallel. If DDT checks are fully serial and you have to
go to SSD on each one, you're looking at a write or delete rate of at
most a thousand blocks a second. If you're dealing with 128 KB blocks
(the typical maximum ZFS recordsize
), that works out to about 125
MBytes a second. This is okay but not all that impressive for a SSD,
and it would mean that deleting large objects could still take quite
a while to complete.
(Deleting 100 GB might take over 13 minutes, for example.)
On the other hand, if we assume that a typical SATA 6 GB/s SSD has a sustained write bandwidth of 550 Mbytes/sec, you only need around 4,400 DDT checks a second in order to hit that data rate for writing out 128 KB ZFS blocks. In practice you're probably not going to get 550 Mbytes/sec of user level write bandwidth out of a deduplicated ZFS pool on a single SSD, because both the necessary DDT writes and the DDT reads will take up some of the bandwidth to and from the SSD (even if the DDT is entirely in RAM, it gets updated on writes and deletes and those updates have to be written back to the SSD).
(This also implies that 4,400 written out DDT blocks a second is about the maximum you can do on a single SSD, for deletes. But I expect that writing out updated DDT entries for deletes is batched and generally doesn't touch that many different blocks of the DDT.)
On the whole, I think that there are enough uncertainties about the performance of deduplicated ZFS pools even on SSDs that I wouldn't want to build one for general use. I'd say 'without a good amount of testing', but I'm not sure that testing would convince me that I wouldn't run into a corner case in ordinary use after long enough.
Comments on this page:
|
|