What the original 4.2 BSD csh hashed (which is not what I thought)

April 20, 2024

Recently, Unix shells keeping track of where they'd found commands came up on the Fediverse again, as it does every so often; for instance, last year I advocated for doing away with the whole thing. As far as I know, (Unix) shell command hashing originated with BSD Unix's csh. which added command hashing and a 'rehash' builtin. However, if you actually read the 4.2 BSD csh(1) manual page, it says something a bit odd (emphasis mine):

rehash: Causes the internal hash table of the contents of the directories in the path variable to be recomputed. This is needed if new commands are added to directories in the path while you are logged in. [...]

The way command hashing typically works in modern shells is that the shell remembers the specific full path to a given command (or sometimes that the command doesn't exist). This is explicitly described in the Bash manual, which says (for example) 'Bash uses a hash table to remember the full pathnames of executable files'. In this case, if you or someone else adds a new command to something in $PATH and you've never run that command before (because it didn't used to exist), you're fine and don't need to rehash; your shell will automatically go looking for a new command in $PATH.

It turns out that the 4.2 BSD csh did not hash commands this way. Instead, well, let's quote a comment from sh.exec.c:

Xhash is an array of HSHSIZ chars, which are used to hash execs. If it is allocated, then to tell whether ``name'' is (possibly) present in the i'th component of the variable path, you look at the i'th bit of xhash[hash("name")]. This is setup automatically after .login is executed, and recomputed whenever ``path'' is changed.

To translate that, csh does not 'hash' where commands are found the way modern shells do. Instead of looking up commands and then remembering where it found them, it scans all of the directories on your $PATH and remembers the hash values of the names it saw in each of them. When csh tries to run a command, it gets the hash value of the command name, looks it up in the hash table, and skips all $PATH entries that hash value definitely isn't in. If you run a newly added command, the odds are very low that its name will hash to a hash value that has the right bit set in its hash table entry.

There can be hash value collisions between different command names and if you have more than 8 $PATH entries, more than one entry can set the same bit, so finding a set bit merely means that potentially the command is there. So this is not as good as remembering exactly where the command is, but on the other hand it takes up a lot less memory; the default csh hash size is 511 bytes. It also means that you definitely want to do 'rehash' when you or someone else modifies any directory on your $PATH, because the odds are very high that any new additions won't be properly recognized.

(What 'rehash' does is that it re-runs the code that sets up this hash table, which is also run when $PATH is changed and so on.)

Written on 20 April 2024.
« Modern Linux mounts a lot of different types of virtual filesystems
Thoughts on potentially realistic temperature trip limit for hardware »

Page tools: View Source.
Search:
Login: Password:

Last modified: Sat Apr 20 23:38:07 2024
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.