2010-11-08
A find optimization and a piece of history, all in one
One of the floating pieces of modern Unix lore is that if you are doing
a find that matches against both filenames and other properties of the
file, it's best to put the filename match first. That is, if you want to
find zero-sized object files the right order is:
find . -name '*.o' -size 0 -print
I called this a piece of modern Unix lore for good reason; this wasn't
necessarily true in the old days (and even today it isn't always true,
depending on the filesystem and how smart your version of find is).
First, let's cover why this can be the faster order. When find is
processing a given directory entry it already has the name, but it
doesn't know the file size; to find out the file size it would have to
stat() the file, which takes an extra system call and possibly an
extra disk read IO. So if find can make a decision on the directory
entry just by checking its name, it can save a stat().
But wait. In order to properly traverse a directory tree, find needs
to know if a directory entry is a subdirectory or something else, and in
the general case that takes a stat(). This gets us back to being just
as slow, because regardless of the order of find operations find is
going to have to stat() the name sooner or later just to find out if it
needs to chdir() into it. So how can find still optimize this?
(There are some clever optimizations that find can do under some
circumstances, but we'll skip those for now.)
What happened is that a while back, Unix filesystem developers realized
that it was very common for programs reading directories to need to know
a bit more about directory entries than just their names, especially
their file types (find is the obvious case, but also consider things
like 'ls -F'). Given that the type of an active inode never changes,
it's possible to embed this information straight in the directory entry
and then return this to user level, and that's what developers did; on
some systems, readdir(3) will now return directory entries with an
additional d_type field that has the directory entry's type.
(This required changes to both filesystems, to embed the information in the on-disk information, and the system call API, to get it to user space. Hence it only works on some filesystems on some versions of Unix.)
Given d_type, find can completely avoid stat()'ing directory
entries if it only needs to know their name or their type. However,
it has to stat() the directory entry if it needs to know more
information, such as the size.
(And if the d_type of directory entries ever gets corrupted, you
can get very odd results.)