Wandering Thoughts archives

2020-04-10

'Deduplicated' ZFS send streams are now deprecated and on the way out

For a fair while, 'zfs send' has had support for a -D argument, aka --dedup, that causes it to send what is called a 'deduplicated stream'. The zfs(1) manpage describes this as:

Generate a deduplicated stream. Blocks which would have been sent multiple times in the send stream will only be sent once. The receiving system must also support this feature to receive a deduplicated stream. This flag can be used regardless of the dataset's dedup property, but performance will be much better if the filesystem uses a dedup-capable checksum (for example, sha256).

This feature is now on the way out in the OpenZFS repository. It was removed in a commit on March 18th, and the commit message explains the situation:

Dedup send can only deduplicate over the set of blocks in the send command being invoked, and it does not take advantage of the dedup table to do so. This is a very common misconception among not only users, but developers, and makes the feature seem more useful than it is. As a result, many users are using the feature but not getting any benefit from it.

Dedup send requires a nontrivial expenditure of memory and CPU to operate, especially if the dataset(s) being sent is (are) not already using a dedup-strength checksum.

Dedup send adds developer burden. It expands the test matrix when developing new features, causing bugs in released code, and delaying development efforts by forcing more testing to be done.

As a result, we are deprecating the use of `zfs send -D` and receiving of such streams. This change adds a warning to the man page, and also prints the warning whenever dedup send or receive are used.

I actually had the reverse misconception about how deduplicated sends worked; I assumed that they required deduplication to be on in the filesystem itself. Since we will never use deduplication, I never looked any further at the 'zfs send' feature. It probably wouldn't have been a net win for us anyway, since our OmniOS fileservers didn't have all that fast CPUs and we definitely weren't using one of the dedup-strength checksums.

(Our current Linux fileservers have better CPUs, but I think they're still not all that impressive.)

The ZFS people are planning various features to deal with the removal of this feature so that people will still be able to use saved deduplicated send streams. However, if you have such streams in your backup systems, you should probably think about aging them out. And definitely you should move away from generating new ones, even though this change is not yet in any release of ZFS as far as I know (on any platform).

solaris/ZFSStreamDedupGone written at 22:58:33; Add Comment

Why my commit messages for configuration files describe my changes

Over the years, I have wound up adopting a particular and somewhat unusual style of commit message for many of my changes to system files like /etc/group, to things like DNS and DHCP control files, and to configuration files. The unusual thing I do is that in my commit message I don't just say why the change is being made, I say what the change itself is (in the abstract). For instance, for a change to our /etc/group, I might say "added <x>, <y>, and <z> to group 'fred'" (with the <>'s as part of the text, because '<cks>' is our local style for writing out logins).

On the surface, this is strange. What I changed is right there in the diff itself; putting it in the commit message appears redundant and feels somewhat like putting a '// add x and y together' comment in code. However, this is not quite true. The diff is what I did change, while the commit message is what I intended to change. When all goes well, the two are the same. But things don't always go well, and when that happens having an explicit description of the intent can be important.

Of course, programmers can have this problem too. But as a a sysadmin and sometimes programmer, I've wound up feeling that sysadmins are both more prone to this problem and better placed to be able to deal with it with commit messages. On the bad side, many more mistakes with the files we deal with are perfectly valid and functional results, just not what we intended. And generally we don't have the sort of tests that programmers do, which would catch some of these mistakes. On the good side, many of our changes are small enough that what we intended to do can be described in high detail in a short commit message, in a way that's not the case for many code changes.

(Generally, our intentions will also appear in our worklog system. But having them in the commit message saves finding the relevant worklog, and since I generally commit right after looking at a diff (and with it still on the screen), writing out what the diff should show may help me actively notice an error.)

PS: It doesn't help that many control and configuration files are rather less readable than well formatted code is, and often give you diffs where what actually changed is harder to see than in most code changes. If you're just adding a login or two to a group, a diff of /etc/group has a lot of noise that can make it hard to see the important signal.

sysadmin/SysadminCommitMsgWhat written at 00:50:44; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.