Wandering Thoughts archives

2008-05-25

Making a good Unix glue language

The problem with a search for alternatives to the Bourne shell is that the Bourne shell is still one of the best Unix programming languages. Or perhaps it would be clearer to call it a 'Unix glue language', a language for connecting existing programs together.

You might sensibly ask what the problems are with things like Perl and Python as Unix glue languages, and my answer would be that they make it too hard to interact with external programs. This is important, because the way you write small programs fast on Unix is to take advantage of as many existing programs as possible. Perl and Python are great if you want to implement everything yourself, and sometimes this is the right answer, but not so great if you want to hastily bang something together.

There's several things that I think make for good interaction with programs:

  • make it easy to get results from programs in useful forms. The Bourne shell is only close on this, since it doesn't easily let you pick out individual words from multi-word output.

    (This probably implies that the language is nearly typeless, at least for output from programs.)

  • let you run code in the middle of a pipeline of commands. This is the killer feature for fully participating in pipelines, instead of just running them (as Perl and Python do).

    (For bonus points, the code should be able to change the global state of the program instead of just running in a sub-process.)

  • have built-in, easy to use features for modifying output, since a lot of the time programs don't output exactly what you want so you need to transform it a bit (which is what a lot of what sed and awk get used for, usually in hard to follow ways).

(Of course, a good Unix glue language should still have all the attributes of a good programming language, many of which the Bourne shell lacks. You certainly ought to be able to write normal code without using external programs at all.)

UnixGlueLanguage written at 23:46:30; Add Comment

2008-05-11

The history of readdir()

In the old days of V7 Unix, directories weren't quite files but they were close enough that you could open and read() them directly, and they had a simple enough structure that there was no library routine to parse their contents; programs like the V7 ls just did it themselves. A good part of the reason for this was that filenames were short (14 characters max), so directory entries could be fixed-sized objects.

In 4BSD, Berkeley expanded the maximum length of filenames from 14 characters to much larger, and since that most filenames were still short, they opted to save disk space by turning directory entries into variable length objects. This made reading directory entries a sufficiently complicated job that they introduced the readdir() C library function to do it for you; however, under the hood the C library still read() the directory as if it was a file, getting the raw filesystem data. Because it is what's most useful for most programs, readdir() returned one directory entry at a time.

I believe that Sun is responsible for the next step, when they came up with NFS. Sun realized that user-level code knowing the filesystem format of directory entries wasn't really very appropriate for a true network filesystem, so they introduced a new system call, getdirents(), to get directory entries in a filesystem independent format. Although this was the only way to get directory entries from NFS filesystems, you could still directly read() directories on local filesystems.

(Sun couldn't just make readdir() be a system call because it was already fixed at returning only one entry per call, which is usually considered too inefficient for a system call. As its name suggests, getdirents() returns a bunch of entries (however many fit into the buffer that you provide).)

Sun's good idea was gradually picked up by other people, including the main BSD line of development that resulted in 4.4 BSD. (Note that some Unixes use the name getdents() for the actual system call, instead of getdirents(). Amusingly, this now includes Solaris, which doesn't even have a getdirents() compatibility routine.)

At some point, Linux took the extra step and forbade read() on directories, forcing you to use the system call (or more likely, using readdir() and letting it worry about things). This had the useful result that you could no longer accidentally cat a directory and get all sorts of gibberish spewed on your screen, without requiring cat (and everything else that reads files) to explicitly refuse to touch directories. This feature does not seem to have spread to Solaris or the *BSDs, at least as far as I can see.

(I was inspired to write this by the recent report of fixing a long-standing seekdir() bug.)

ReaddirHistory written at 22:20:54; Add Comment

By day for May 2008: 11 25; before May; after May.

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.