Rsync'ing (only) some of the top level pieces of a directory

February 6, 2023

Suppose, not hypothetically, that you have a top level directory which contains some number of subdirectories, and you want to selectively create and maintain a copy of only part of this top level directory. However, what you want to copy over changes over time and you want un-wanted things to disappear on the destination (because otherwise they'll stick around using up space that you need for things you care about). Some of the now-unwanted things will still exist on the source but you don't want them on the copy any more; others will disappear entirely on the source and need to disappear on the destination too.

This sounds like a tricky challenge with rsync but it turns out that there is a relatively straightforward way to do it. Let's say that you want to decide what to copy based (only) on the modification time of the top level subdirectories; you want a copy of all recently modified subdirectories that still exist on the source. Then what you want is this:

cd /data/prometheus/metrics2
find * -maxdepth 0 -mtime -365 -print |
 sed 's;^;/;' |
  rsync -a --delete --delete-excluded \
        --include-from - --exclude '/*' \
        . backupserv:/data/prometheus/metrics2/

Here, the 'find' prints everything in the top level directory that's been modified within the last year. The 'sed' takes that list of names and sticks a '/' on the front, turning names like 'wal' into '/wal', because to rsync this definitely anchors them to the root of the directory tree being (recursively) transferred (per rsync's Pattern Matching Rules and Anchoring Include/Exclude Patterns). Finally, the rsync command says to delete now-gone things in directories we transfer, delete things that are excluded on the source but present on the destination, include what to copy from standard input (ie, our 'sed'), and then exclude everything that isn't specifically included.

(All of this is easier than I expected when I wrote my recent entry on discovering this problem; I thought I might have to either construct elaborate command line arguments or write some temporary files. That --include-from will read from standard input is very helpful here.)

If you don't think to check the rsync manual page, especially its section on Filter Rules, you can have a little rsync accident because you absently think that rsync is 'last match wins' instead of 'first match wins' and put the --exclude before the --include-from. This causes everything to be excluded, and rsync will dutifully delete the entire multi-terabyte copy you made in your earlier testing, because that's what you told it to do when you used --delete-excluded.

(In general I should have carefully read all of the rsync manual page's various sections on pattern matching and filtering. It probably would have saved me time, and it would definitely have left me better informed about how rsync actually behaves.)


Comments on this page:

I have to admit that find * makes me twitch.

First of all, it will not include objects whose name starts with a dot. Though since this is for a TSDB, maybe they can never happen. (And at least for . and .. you wouldn’t want those included, obviously.)

Secondly, one stray object whose name starts with a dash might ruin the day. Though since again this is for a TSDB, maybe they can never happen.

Nonetheless I would personally rather replace the find * with find ./* at least – or better still, with find . -mindepth 1 -maxdepth 1. Both replacements will require changing the next line to sed 's;^\./;/;', of course.

Note though that if you go with the find . -mindepth 1 -maxdepth 1 version and find is the GNU Findutils version, then you can simplify the sed away entirely by replacing find’s -print switch with -printf '/%P\n'

(Finally of course using the rsync -0 and find -print0 or -printf '/%P\0' switches would tighten things up further.)

By cks at 2023-02-13 19:11:43:

In the case of Prometheus's TSDB, the contents of the TSDB's top level directory have a fixed format and are very predictable so the 'find *' is relatively safe. If top names starting with '.' or '-' appear, something has already gone badly wrong. However, I believe I can count on GNU Findutils, since this is Ubuntu and they're very unlikely to switch to a different find in future versions, so your non-sed version is clearly better.

By edgewood at 2024-02-29 19:57:27:

Late comment, but I almost always use the rsync options --itemize-changes and --dry-run to ensure that the rest of the options do what I'm expecting, then drop --dry-run when I'm satisfied.

Written on 06 February 2023.
« Some things on Prometheus's new feature to keep alerts firing for a while
What I want in Prometheus (as a whole) is aggregating alert notifications »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Feb 6 23:08:38 2023
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.