A major caution when using 'rsync -a' to copy or move directory trees

February 15, 2022

We had a learning experience the other day. Part of the learning experience was about the behavior of du in the face of hardlinks and another part of it was to do with odd ZFS space usage behavior, but the largest part and the ultimate cause was because 'rsync -a' doesn't preserve hardlinks. If you copy or move a directory tree with 'rsync -a' and it contains internal hardlinks, your new copy will break those hardlinks and copy each hardlink separately. Among other effects, this will increase the amount of disk space that the new tree uses.

This limitation is widely known on the Internet and is explicitly spelled out in the rsync manual page section on the -a option:

This is equivalent to -rlptgoD. It is a quick way of saying you want recursion and want to preserve almost everything (with -H being a notable omission). The only exception to the above equivalence is when --files-from is specified, in which case -r is not implied.

Note that -a does not preserve hardlinks, because finding multiply-linked files is expensive. You must separately specify -H.

There's also a discussion of this in the section on the --hard-links option.

(Another exception is that '-a' doesn't imply '--sparse', to preserve sparse files as sparse.)

This is at one level a sensible tradeoff. Hard links are uncommon these days, they aren't supported in all environments, and they require potentially unbounded amounts of memory to process (since you have to keep track of every file you've seen with a hard link, so you can tell if you saw it again). If you search for discussions of rsync and hardlinks on the Internet, you can find people who've had problems with memory usage when dealing with large, heavily hardlinked directory trees.

At the same time it's not entirely ideal for system administrators who by default think of 'rsync -a' as a faithful way to copy, clone, move, or back up a directory tree. While it is in the manual page, the rsync manual page is very big and most people don't read it carefully even once (never mind often enough to remember this if they haven't been burned by it). And usually it works because usually you don't have hard links or it doesn't really matter if they get broken (just like it usually doesn't matter if sparse files get de-sparsed in an rsync copy).

Since we've had a learning experience about rsync and hardlinks, we're probably going to remember this for years to come (or at least I hope). We're certainly updating scripts and canned practices to use '-H' with '-a', and now that I've looked it up we may well add '-S' to that too. And I should probably read over the entire rsync manual page to see if we're missing anything else, even though I expect it to be very boring.

(I had a narrow personal escape with this. I almost made a new root filesystem for my home desktop recently, and if I had, I might well have copied the old root filesystem to the new one with 'rsync -a'.)


Comments on this page:

By Alex Xu (Hello71) at 2022-02-15 21:33:07:

when copying a whole system, -X is also quite important. ping in particular may stop working when running non-root after it loses its file caps. on recent systemd distros this shouldn't be a huge issue due to ping_group_range, but on non-systemd or older systemd distros it may be confusing to receive permission denied errors when executing ping.

By cpu-chow at 2022-02-16 00:30:47:

The (xfs_)dump/restore utilities transfer whole filesystems faithfully, but not remotely as far as I am aware. I suppose you could improvise using full/incremental dumps and a clever socat or ssh incantation piping to remote (xfs_)restore, but if the remote destination is not static, or if you require two-way synch, all bets are off.

[hearty thanks for your delightfully arcane and thought-provoking blog]

By Andrew Sh at 2022-02-16 02:12:11:

The -A option may be important also. Several good examples: https://wiki.archlinux.org/title/Rsync

By Michael at 2022-02-16 02:16:47:

I think you meant -S (--sparse), not -s (--protect-args).

By Arnaud Gomes at 2022-02-16 03:13:23:

My go-to rsync invocation includes -aAXSH --numeric-ids. Maybe add -aP and/or --delete, depending on the context.

   -- A
By Albert at 2022-02-16 05:03:21:

For maximum reliability I always use -avzHAX.

I have rsync -azvHC in muscle memory…

By cks at 2022-02-16 09:04:30:

I did indeed mean -S, not -s; for no good reason, I just assumed that the short version of --sparse was lower case instead of upper case. I've updated the entry, and thank you.

(rsync is like ls, it's a game of what short option isn't valid.)

From 71.219.61.14 at 2022-02-16 10:24:54:

Like other commentators here I've got my fingers hardwired to always use -Hav

In general the "exact copy" thing has these options:

 -a
 -x don't cross filesystem boundaries.
 -A ACLs too.
 -X extended attributes too.
 -H keep hard-links.

Somewhat less exact adds one or more of:

 -O [optional] don't preserve directory modification times.
 -J [optional] don't preserve symlink modification times.
 --noatime [optional] do not change access time of source files.
 --numeric-ids [optional] preserve owner and group numbers.

with -OJ directory and symlink times are not preserved exactly, but usually preserving them is pointless and expensive as is omitting --noatime. Sometimes --numeric-ids is more exact and sometimes it is not.

Written on 15 February 2022.
« Beware of trying to compare the size of subtrees with du
Lurking complexities in a web server that just serves static files »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Tue Feb 15 20:55:41 2022
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.