Beware of trying to compare the size of subtrees with du

February 14, 2022

One of the things I like to do to understand space usage is to use du to look at both the aggregate usage of a directory tree and a breakdown of where the space is going (often with the handy -h options to GNU du and sort). This is also something you may wind up doing if you want to compare the disk space usage of two versions of a directory tree and its subtrees (for example, the disk space usage in / for two systems). However, there is a somewhat subtle trap hiding in a comparison of subtree sizes, and that trap is hardlinks.

The trap is very clearly described in the du info documentation:

If two or more hard links point to the same file, only one of the hard links is counted. The FILE argument order affects which links are counted, and changing the argument order may change the numbers and entries that ‘du’ outputs.

(This can be turned off with the -l option, if you remember.)

This is a fair decision on du's part. It wants to give you an honest view of how much space in total is consumed by the top level argument, which means counting hard links pointing to the same file only once. Once it does that, it's easier to report the space usage of hard linked files only in the first subtree it finds them in, and it would be odd if the sum of the sizes of subtrees didn't add up to the top level size.

However, this has some surprising consequences. First, you can get different answers if you do 'du -h fred/barney' and if you do 'du -h fred' and look at the line for fred/barney. If there are some hard links in fred/barney that are for files in other parts of the fred/ tree, the first du will include them but the second du might exclude them from the fred/barney total, because they've already been counted in another subtree.

Second, two versions of the same directory tree may report a different space breakdown between subtrees even if the total space is the same. GNU Du doesn't promise to traverse directory trees in any particular order, which means it may encounter hard links in a different order in two versions of a directory tree. This can result in the space of hard linked files that cross between subtrees being attributed to different subtrees in different versions of the directory tree.

If the two directory trees are already only mostly the same and you're trying to compare them to pick out the differences, all of this can wind up leading you astray. If you du each tree and then look for space differences in the subtrees to identify where things differ, you can wind up seeing a distorted picture of what's really going on.

If you need to compare space usage all of the way down, I think your best choice is to remember 'du -l'. It will give you a misleading picture about the total, aggregate space usage, but at least you'll have an honest picture of where things differ. If you want to check total aggregate space usage accurately, you can only do 'du -hs' on a single thing at once and you'll have to manually work through the trees piece by piece.


Comments on this page:

ncdu is a bit smarter because it traverses everything down but still, similar problem are there.

This is much the same type of problem as asking how much memory a process is using – though a version with far fewer confounding factors, I guess. (Thankfully!) Looked at that way, the obvious answer is to either track all paths seen per inode or to traverse the tree twice to compile a list of inodes with multiple links on the first pass, to allow reporting space shared by multiple subtrees separately from the individual subtrees. When du was written both approaches would have been prohibitive, but I wonder if that remains the case In this modern gigabytes-of-RAM, flash storage era.

Written on 14 February 2022.
« Go generics: the question of types made from generic types and type sets
A major caution when using 'rsync -a' to copy or move directory trees »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Feb 14 23:52:56 2022
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.