A find
optimization and a piece of history, all in one
One of the floating pieces of modern Unix lore is that if you are doing
a find
that matches against both filenames and other properties of the
file, it's best to put the filename match first. That is, if you want to
find zero-sized object files the right order is:
find . -name '*.o' -size 0 -print
I called this a piece of modern Unix lore for good reason; this wasn't
necessarily true in the old days (and even today it isn't always true,
depending on the filesystem and how smart your version of find
is).
First, let's cover why this can be the faster order. When find
is
processing a given directory entry it already has the name, but it
doesn't know the file size; to find out the file size it would have to
stat()
the file, which takes an extra system call and possibly an
extra disk read IO. So if find
can make a decision on the directory
entry just by checking its name, it can save a stat()
.
But wait. In order to properly traverse a directory tree, find
needs
to know if a directory entry is a subdirectory or something else, and in
the general case that takes a stat()
. This gets us back to being just
as slow, because regardless of the order of find
operations find
is
going to have to stat()
the name sooner or later just to find out if it
needs to chdir()
into it. So how can find
still optimize this?
(There are some clever optimizations that find
can do under some
circumstances, but we'll skip those for now.)
What happened is that a while back, Unix filesystem developers realized
that it was very common for programs reading directories to need to know
a bit more about directory entries than just their names, especially
their file types (find
is the obvious case, but also consider things
like 'ls -F
'). Given that the type of an active inode never changes,
it's possible to embed this information straight in the directory entry
and then return this to user level, and that's what developers did; on
some systems, readdir(3)
will now return directory entries with an
additional d_type
field that has the directory entry's type.
(This required changes to both filesystems, to embed the information in the on-disk information, and the system call API, to get it to user space. Hence it only works on some filesystems on some versions of Unix.)
Given d_type
, find
can completely avoid stat()
'ing directory
entries if it only needs to know their name or their type. However,
it has to stat()
the directory entry if it needs to know more
information, such as the size.
(And if the d_type
of directory entries ever gets corrupted, you
can get very odd results.)
|
|