Chris's Wiki :: blog/tech/ExpensiveDeduplication Commentshttps://utcc.utoronto.ca/~cks/space/blog/tech/ExpensiveDeduplication?atomcommentsDWiki2011-11-08T12:37:04ZRecent comments in Chris's Wiki :: blog/tech/ExpensiveDeduplication.From 109.76.102.66 on /blog/tech/ExpensiveDeduplicationtag:CSpace:blog/tech/ExpensiveDeduplication:c9c38ab81253e453b022bf75502790a31a719d4eFrom 109.76.102.66<div class="wikitext"><p>Your post on <a href="http://utcc.utoronto.ca/~cks/space/blog/unix/FundamentalFileOperation">http://utcc.utoronto.ca/~cks/space/blog/unix/FundamentalFileOperation</a> got me thinking of the races involved,
with the auto invalidated attributes idea in comment 1 above.</p>
<p>If user space was updating the attributes while the file was being written to, it might add the wrong attribute values. Also one has to handle multple processes trying to update these attributes.
So the process would have to be something like:</p>
<pre>
user_prog read attribute and ensure missing || ! 'locked' value
user_prog del attribute (if it was present)
user_prog set attribute to 'locked' (fails if attribute present)
user_prog read file and generate checksum
user_prog update attribute with checksum (fails if ! present)
</pre>
<p>I.E. to avoid races, we'd need an attribute 'set', 'update'
to operate atomically as above. Awkward.
Also what if the user space prog dies without clearing the locked value.
Bah, looks like we need higher level functions provided by the file system for this, with the above operations done internally by the file system :(
Or perhaps provide mandatory locking of these file attributes.</p>
</div>2011-11-08T12:37:04ZFrom 138.92.10.18 on /blog/tech/ExpensiveDeduplicationtag:CSpace:blog/tech/ExpensiveDeduplication:5c0c43927d2039b65a3fdf3d093fd8aaf0af5f58From 138.92.10.18<div class="wikitext"><p>I think it's important to point out that ZFS does in-line block-based deduplication, and that requires sufficient amounts of cache which can be provided via RAM and an optional (but highly recommended) CACHE device (or striped CACHE devices; preferably SSD).</p>
<p>Compare it to NetApp's implementation of dedupe (A-SIS). A-SIS is also block-based, but it is post-process, meaning once a day a process will analyze the blocks for duplicates. That does require RAM, but not as much as an in-line implementation. NetApp also imposes limits on the size of a flexvol that can be deduped, based upon the model (essentially CPU and RAM).</p>
<p>ZFS, in its current state, is definitely a file system that requires some thought about the hardware configuration.</p>
<p>Since I started this with a comment, I'll redirect by making another one regarding ZFS and Solaris 11: Solaris 11 supports encrypted ZFS. :))</p>
</div>2011-11-03T16:49:08ZBy Chris Siebenmann on /blog/tech/ExpensiveDeduplicationtag:CSpace:blog/tech/ExpensiveDeduplication:60a52a788f8a818af763a49924915aac4517a37bChris Siebenmann<div class="wikitext"><p>I've been kind of assuming that computing the dedup checksum is free for
a filesystem implementation, although as you note that's not quite the
case for a file-based dedup system. My view is that even if you get the
checksum for free, the dedup table issue is serious and makes dedup
expensive.</p>
<p>(You can get a file checksum for free only if the file is written as a
sequential stream. This is a common case but not the only one.)</p>
</div>2011-10-31T18:07:16ZFrom 109.78.103.253 on /blog/tech/ExpensiveDeduplicationtag:CSpace:blog/tech/ExpensiveDeduplication:254b7e3857992f7daaf9b1a95b98a2ea4b994e94From 109.78.103.253<div class="wikitext"><p>A couple of comments about the performance of file level dedup.</p>
<p>One can consider the size of a file as a first level ID.
I.E. one can exclude files from the set to checksum with unique sizes.
That speeds dedup a lot in most cases.</p>
<p>One thing that file systems could provide to help with file level identification, would be auto expired attributes.
I.E. a set of (extended) attributes that would be auto invalidated on write()
You could cache checksums there which would help with rsync too.</p>
</div>2011-10-31T13:03:32Z