Deduplication is always going to be expensive
I've seen a reaction or two to my entry on ZFS deduplication that suggest that the reason ZFS dedup has significant issues is that the Solaris programmers did a bad job on it. I want to set the record straight: this is false. Deduplication is fundamentally not an easy operation and no form of deduplication can be fast unless it is fed huge amounts of resources.
Any deduplication system needs to maintain a dedup table in some form. It's the core data structure because the core operation in deduplication is 'given this new object to be written, do I already have a copy?' The performance of any dedup system depends on how fast it can look up and modify random entries in this table, while the size of this table depends primarily on how many unique entities the system is tracking. All dedup systems want to keep their dedup tables in RAM for good performance on writes and deletes.
(Dedup table IO is random because the dedup checksum has to be assumed to be randomly distributed. This is certainly the case for all good cryptographic hashes.)
Picking block level deduplication is not a ZFS mistake. Block level deduplication is more powerful and more effective than file level deduplication, at the cost of keeping track of more unique entities. Block level dedup is also much easier than file level dedup to add to a Unix filesystem for various reasons that all devolve to 'writing blocks is a fundamental filesystem activity and writing files is not'.
In short, dedup inherently requires bunches of RAM; the more you dedup, the more memory. Block level dedup requires more memory than file level dedup, but gets better results.
There are only two and a half areas where ZFS dedup really makes life worse for itself. The half is that ZFS dedup is done at the pool level instead of the filesystem level and pools are general larger (sometimes significantly so) than you would make any single filesystem. All else being equal, the larger the entity that you do deduplication over, the more unique entities you'll be dealing with and the larger your dedup table will be.
(Actually this is only half true; if you have N terabytes of disk space and a fixed block size, you can't magically shrink the collective dedup tables you need by slicing the space up into smaller chunks. If anything, smaller chunks will result in more total dedup table space because there is no cross-filesystem sharing of duplicate content. What you gain is potentially needing less RAM to hold the dedup tables of the active filesystem areas, if only some of your filesystems see much write traffic.)
The genuinely unforced ZFS error is artificially limiting how much RAM the dedup table can take up to less than 25% of system RAM. I actually think that this is less significant than it looks, because I think that this only costs you a factor of two or three for how large a ZFS pool you can dedup in RAM and this is often not going to be enough.
The final ZFS issue is that deleting a snapshot is a uniquely bad operation for any dedup system because it causes a huge amount of object deletions in a single action that's generally uninterruptible. ZFS could do better than it does, but only at the expense of mutating a snapshot that's in the process of being deleted (which would enable genuinely suspending a snapshot removal). The other way you can get this effect on any block level dedup system is to delete a sufficiently large file.
(Even today removing a sufficiently large file can be surprisingly slow on many filesystems, precisely because file removal doesn't report success until it has freed up all blocks associated with the file. It can take a significant amount of disk access just to determine what all of these blocks are for a large file.)