A major caution when using 'rsync -a' to copy or move directory trees

February 15, 2022

We had a learning experience the other day. Part of the learning experience was about the behavior of du in the face of hardlinks and another part of it was to do with odd ZFS space usage behavior, but the largest part and the ultimate cause was because 'rsync -a' doesn't preserve hardlinks. If you copy or move a directory tree with 'rsync -a' and it contains internal hardlinks, your new copy will break those hardlinks and copy each hardlink separately. Among other effects, this will increase the amount of disk space that the new tree uses.

This limitation is widely known on the Internet and is explicitly spelled out in the rsync manual page section on the -a option:

This is equivalent to -rlptgoD. It is a quick way of saying you want recursion and want to preserve almost everything (with -H being a notable omission). The only exception to the above equivalence is when --files-from is specified, in which case -r is not implied.

Note that -a does not preserve hardlinks, because finding multiply-linked files is expensive. You must separately specify -H.

There's also a discussion of this in the section on the --hard-links option.

(Another exception is that '-a' doesn't imply '--sparse', to preserve sparse files as sparse.)

This is at one level a sensible tradeoff. Hard links are uncommon these days, they aren't supported in all environments, and they require potentially unbounded amounts of memory to process (since you have to keep track of every file you've seen with a hard link, so you can tell if you saw it again). If you search for discussions of rsync and hardlinks on the Internet, you can find people who've had problems with memory usage when dealing with large, heavily hardlinked directory trees.

At the same time it's not entirely ideal for system administrators who by default think of 'rsync -a' as a faithful way to copy, clone, move, or back up a directory tree. While it is in the manual page, the rsync manual page is very big and most people don't read it carefully even once (never mind often enough to remember this if they haven't been burned by it). And usually it works because usually you don't have hard links or it doesn't really matter if they get broken (just like it usually doesn't matter if sparse files get de-sparsed in an rsync copy).

Since we've had a learning experience about rsync and hardlinks, we're probably going to remember this for years to come (or at least I hope). We're certainly updating scripts and canned practices to use '-H' with '-a', and now that I've looked it up we may well add '-S' to that too. And I should probably read over the entire rsync manual page to see if we're missing anything else, even though I expect it to be very boring.

(I had a narrow personal escape with this. I almost made a new root filesystem for my home desktop recently, and if I had, I might well have copied the old root filesystem to the new one with 'rsync -a'.)

Written on 15 February 2022.
« Beware of trying to compare the size of subtrees with du
Lurking complexities in a web server that just serves static files »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Tue Feb 15 20:55:41 2022
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.