A find optimization and a piece of history, all in one

November 8, 2010

One of the floating pieces of modern Unix lore is that if you are doing a find that matches against both filenames and other properties of the file, it's best to put the filename match first. That is, if you want to find zero-sized object files the right order is:

find . -name '*.o' -size 0 -print

I called this a piece of modern Unix lore for good reason; this wasn't necessarily true in the old days (and even today it isn't always true, depending on the filesystem and how smart your version of find is).

First, let's cover why this can be the faster order. When find is processing a given directory entry it already has the name, but it doesn't know the file size; to find out the file size it would have to stat() the file, which takes an extra system call and possibly an extra disk read IO. So if find can make a decision on the directory entry just by checking its name, it can save a stat().

But wait. In order to properly traverse a directory tree, find needs to know if a directory entry is a subdirectory or something else, and in the general case that takes a stat(). This gets us back to being just as slow, because regardless of the order of find operations find is going to have to stat() the name sooner or later just to find out if it needs to chdir() into it. So how can find still optimize this?

(There are some clever optimizations that find can do under some circumstances, but we'll skip those for now.)

What happened is that a while back, Unix filesystem developers realized that it was very common for programs reading directories to need to know a bit more about directory entries than just their names, especially their file types (find is the obvious case, but also consider things like 'ls -F'). Given that the type of an active inode never changes, it's possible to embed this information straight in the directory entry and then return this to user level, and that's what developers did; on some systems, readdir(3) will now return directory entries with an additional d_type field that has the directory entry's type.

(This required changes to both filesystems, to embed the information in the on-disk information, and the system call API, to get it to user space. Hence it only works on some filesystems on some versions of Unix.)

Given d_type, find can completely avoid stat()'ing directory entries if it only needs to know their name or their type. However, it has to stat() the directory entry if it needs to know more information, such as the size.

(And if the d_type of directory entries ever gets corrupted, you can get very odd results.)

Written on 08 November 2010.
« When Linux's rp_filter might make sense
Directory link counts and a find trick »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Nov 8 23:41:54 2010
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.