2022-03-01
The problem of keeping track of hardlinks as you traverse a directory tree
One of the reactions to my discovery about rsync not handling hardlinks by default was surprise that memory was an issue of concern when doing this. After all, you might ask, how hard is the problem and how much memory are we talking about? That depends on what sort of program you are and how thorough a job you want to do.
A program like du
has the simplest case. GNU du's normal behavior
is to ignore hardlinks after the first time it sees them, so all it needs to do is keep track
of enough information to identify a file with more than one link
if it sees it again. Normally this is the file's inode number and
device. Modern Unixes use 64-bit inode numbers and generally 64-bit
device numbers, so you're looking at 16 bytes per hardlinked file.
Since you don't want to do a linear search of a potentially very
large array every time you encounter a hardlinked file, you're also
going to need additional space for some sort of random access data
structure. It's probably not worth keeping track of the link count,
especially since it's also a 64-bit integer on at least one popular
platform (Linux); you're probably losing more space than you gain
back by being able to drop entries when you've seen all their links.
However, all of this space usage is relatively trivial by modern
standards.
(You can reduce the memory needs in most situations by having a two-level data structure, where you give each separate device number its own inode lookup structure. This reduces the per-file information down to one 64-bit number from two, since you're storing the device number only once instead of once per hardlinked file. Usually you'll only be dealing with a few device numbers so this is a real win. It does complicate the code a bit. I don't know if GNU du uses this approach.)
A program like rsync
or tar
has a harder problem, because it
also needs to keep track of the full file name for some instance
of the hardlinked file (normally it will keep track of the first
one). Full file names can be long, especially if you have something
annoying like a deeply nested directory structure, and given the
additional memory usage you now probably do want to also store the
link count so you can delete entries once you no longer need them.
(The good news is that you don't need to look things up by file name, so you can continue to only have a fast random access data structure for the inode number plus the device. And you can still optimize the handling of device numbers, although this is no longer so much of a memory saving on a percentage basis.)
The worst case for both du
style and rsync
style programs is
when you are working with a directory tree where all files have
hardlinks that go outside the tree. In this case, both programs wind up
permanently keeping track of all files in the tree in RAM; even if you
track link count, you'll never see all of the links for any file so you
can never drop any entries. Such directory trees can happen in a variety
of situations, including space-saving backups that hardlink unchanged
files between backup trees.
For scale, a large root filesystem on one of our compute servers has around 2.5 million files with the text of their full file names amounting to just over 200 Mbytes. On our NFS fileservers, the filesystem with the most files has around 7 million of them and a total full filename text size on the order of 665 Mbytes. On modern machines with multiple gigabytes of RAM that are often mostly unused, programs like du and rsync should be able to handle even the very unlikely worst case situation without blowing up. In more constrained environments, they might start to have issues.
(This was sparked by reddit comments on my rsync entry.)