2014-03-21
Thinking about when rsync's incremental mode doesn't help
I mentioned recently that I had
seen cases where rsync
's incremental mode didn't speed it up to any
significant degree. Of course there's an obvious way to create such a
situation, namely erasing and replacing all of the files involved, but
that wasn't it for us. Our case was more subtle and it's taken me a
while to understand why it happened. Ultimately it comes down to having
a subtly wrong mental model of what takes time in rsync
.
Our specific situation was replicating a mail spool from one machine to
another. There were any number of medium and large inboxes on the mail
spool, but for the most part they were just getting new messages; as
far as we know no one did some major inbox reorganization that would
have changed all of their inbox. Naively you'd think that an rsync
incremental transfer here could go significantly faster than a full
copy; after all, most of what you need to transfer is just the new
messages added to the end of most mailboxes.
What I'm quietly overlooking here is the cost of finding out what
needs to be transferred, and in turn the reason for this is that I've
implicitly assumed that sending things over the network is (very)
expensive in comparison to reading them off the disk. This is an easy
bias to pick up when you work with rsync
, because rsync
's entire
purpose is optimizing network transmission and when you use it you
normally don't really think about how it's finding out the differences.
What's going on in our situation is that when rsync
sees a changed
file it has to read the entire file and compute block checksums (on
both sides). It doesn't matter if you've just appended one new email
message to a 100 Mbyte file for a measly 5 Kbyte addition at the end;
rsync
still has to read it all. If you have a bunch of midsized
to large files (especially if they're fragmented, as mail inboxes
often are), simply reading through all of the changed files can take a
significant amount of time.
In a way this is a variant of Amdahl's law. With a lot of slightly
changed files an rsync incremental transfer may speed up the network
IO and reduce it to nearly nothing but it can't do much about the
disk IO. Reading lots of data is reading lots of data, whether or
not you send it over the network; you only get a big win out of not
sending it over the network if the network is slow compared to the
disk IO. The closer disk and network IO speeds are to each other,
the less you can save here (and the more that disk IO speeds will
determine the minimum time that an rsync
can possibly take).
The corollary is that where you really save is by doing less disk IO as
well as less network IO. This is where and why things like ZFS snapshots
and incremental 'zfs send
' can win big, because to a large extent they
have very efficient ways of knowing the differences that need to be
sent.
PS: I'm also making another assumption, namely that CPU usage is free and is not a limiting factor. This is probably true for rsync checksum calculations on modern server hardware, but you never know (and our case was actually on really old SPARC hardware so it might actually have been a limiting factor).