Thinking about when rsync's incremental mode doesn't help

March 21, 2014

I mentioned recently that I had seen cases where rsync's incremental mode didn't speed it up to any significant degree. Of course there's an obvious way to create such a situation, namely erasing and replacing all of the files involved, but that wasn't it for us. Our case was more subtle and it's taken me a while to understand why it happened. Ultimately it comes down to having a subtly wrong mental model of what takes time in rsync.

Our specific situation was replicating a mail spool from one machine to another. There were any number of medium and large inboxes on the mail spool, but for the most part they were just getting new messages; as far as we know no one did some major inbox reorganization that would have changed all of their inbox. Naively you'd think that an rsync incremental transfer here could go significantly faster than a full copy; after all, most of what you need to transfer is just the new messages added to the end of most mailboxes.

What I'm quietly overlooking here is the cost of finding out what needs to be transferred, and in turn the reason for this is that I've implicitly assumed that sending things over the network is (very) expensive in comparison to reading them off the disk. This is an easy bias to pick up when you work with rsync, because rsync's entire purpose is optimizing network transmission and when you use it you normally don't really think about how it's finding out the differences. What's going on in our situation is that when rsync sees a changed file it has to read the entire file and compute block checksums (on both sides). It doesn't matter if you've just appended one new email message to a 100 Mbyte file for a measly 5 Kbyte addition at the end; rsync still has to read it all. If you have a bunch of midsized to large files (especially if they're fragmented, as mail inboxes often are), simply reading through all of the changed files can take a significant amount of time.

In a way this is a variant of Amdahl's law. With a lot of slightly changed files an rsync incremental transfer may speed up the network IO and reduce it to nearly nothing but it can't do much about the disk IO. Reading lots of data is reading lots of data, whether or not you send it over the network; you only get a big win out of not sending it over the network if the network is slow compared to the disk IO. The closer disk and network IO speeds are to each other, the less you can save here (and the more that disk IO speeds will determine the minimum time that an rsync can possibly take).

The corollary is that where you really save is by doing less disk IO as well as less network IO. This is where and why things like ZFS snapshots and incremental 'zfs send' can win big, because to a large extent they have very efficient ways of knowing the differences that need to be sent.

PS: I'm also making another assumption, namely that CPU usage is free and is not a limiting factor. This is probably true for rsync checksum calculations on modern server hardware, but you never know (and our case was actually on really old SPARC hardware so it might actually have been a limiting factor).

Written on 21 March 2014.
« Killing (almost) all processes on Linux is not recoverable
Avoiding reboots should not become a fetish »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Mar 21 02:10:01 2014
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.