Command hashing in Unix shells is probably no longer worth it

July 20, 2023

Recently, I had an experience with Bash (and not for the first time with this sort of thing):

Dear bash: if you have a path for a command cached and trying to execute that file gets ENOENT, perhaps you should, you know, *check $PATH again*?

Then I endorsed getting rid of shell command hashing.

Once upon a time, Unix computers were slow, system calls were slow, Unix kernels only cached simple things, RAM was small relative to the size of directories, $PATH often contained many directories, and your filesystems were on slow spinning rust. In that environment, 4.2 BSD's csh(1) introduced a feature where it would maintain an internal map (a 'hash') from command names like 'make' to the full path it had found for them in your $PATH. rather than search through $PATH each time. This csh feature is command hashing, and it was copied into (some) later shells, including Bash, so that they all cached the file path of commands.

Any time you have a cache, you have to think about cache invalidation. The csh answer to cache invalidation was that it didn't, or more exactly it gave you a builtin (called 'rehash') to flush the cache and it was up to you to do so when you felt like it, or when you hit an error caused by this. This behavior was then generally copied by later shells, although Bash has a non-default option to sometimes automatically flush a bit of the cache (in the 'checkhash' shopt).

Generally, none of those original early 1980s things are the case any more on modern Unix machines. Machines are fast, system calls are fast, $PATH often contains only a few directories, memory is large compared to the size of things like /usr/bin's on-disk directory (even with the growth in /usr/bin's contents), filesystems are often on fast or very fast disks, and kernels got more sophisticated. Specifically, people realized that name lookups happen so often in Unix that it was worth building in kernel cache structures specifically for them, generally including also negative entries ('this name is not in this directory'); this is Linux's dentry cache (also, and negative dentries) and FreeBSD's kernel name cache. Under typical circumstances, name lookup in frequently used directories is now a very fast operation.

As we know from Amdahl's law, optimizing an operation that's already very fast doesn't provide you with much time savings. In the mean time, shells haven't gotten much better about their command hash cache invalidation, which means that every so often they get things actively wrong, either not running a command when they could or running the wrong command given your $PATH's current state.

Written on 20 July 2023.
« HTTP has become the default, universal communication protocol
When we take things out of service, I should write up how they worked »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Thu Jul 20 21:49:37 2023
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.