The format of strings in early (pre-C) Unix

January 7, 2016

The very earliest version of Unix was written before C was created and even after C's creation the whole system wasn't rewritten in it immediately. Courtesy of the Unix Heritage Society, much of the surviving source code from this era is available online. It doesn't make up a complete source tree for any of the early Research Unixes, but it does let us peek back in time to read code and documentation that was written in that pre-C era.

In light of a recent entry on C strings, I became curious about what the format of strings was in Unix back before C existed. Even in the pre-C era, the kernel and assembly language programs needed strings for some things; for example, system calls like creat() and open() have to take filaname arguments in some form, and programs often have constant strings for messages that they'll print out. So I went and looked at early Unix source and documentation, for Research V1 (entirely pre-C), Research V2, and Research V3.

I will skip to the punchline:

Unix strings have been null-terminated from the very beginning of Unix, even before C existed.

Unix did not get null-terminated strings from C. Instead, C got null-terminated strings from Unix (specifically, Research V1 Unix). I don't know where V1 Unix got them from, if anywhere.

There's plenty of traces of this in the surviving Research V1 files. For instance, the V1 creat manpage says:

creat creates a new file or prepares to rewrite an existing file called name; name is the address of a null--terminated string. [...]

The V1 shell also contains uses of null-terminated strings. These are written with an interesting notation:

[...]
   bec 1f / branch if no error
   jsr r5,error / error in file name
       <Input not found\n\0>; .even
   sys exit
[...]
qchdir:
   <chdir\0>
glogin:
   <login\0>
[...]

Not all strings in the shell are null-terminated in this way, probably because it was natural to have their lengths just known in the code. If we need more confirmation, the error function specifically comments that a 0 byte is the end of the 'line' (here a string):

error:
   movb  (r5)+,och / pick up diagnostic character
   beq   1f / 0 is end of line
   mov   $1,r0 / set for tty output
   sys   write; och; 1 / print it
   br    error / continue to get characters
1:
   [... goes on ...]

I suspect that one reason this format for strings was adopted was simply that it was easy to express and support in the assembler. Based on the usage here, a string was simply a '<....>' block that supported some escape sequences, including \0 for a null byte; presumably this was basically copied straight into the object file after escape translation. There's no need for either the assembler or the programmer to count up the string length and then get that too into the object code somehow.

(It turns out that the V1 as manpage documents all of this.)

PS: it's interesting that although the V1 write system call supports writing many bytes at once, the error code here simply does brute force one character at a time output. Presumably that was just simpler to code.

Update: See the comments for interesting additional information and pointers. Other people have added a bunch of good stuff.


Comments on this page:

By anonymous at 2016-01-08 05:28:12:

null-terminated strings originate from PDP-11 and PDP-10 assembly languages. People researching on Unix used to work on these machines. So Unix strings come from DEC actually.

By cks at 2016-01-09 19:19:41:

The discussion on Hacker News has some interesting and useful information. Several comments that particularly struck me:

noselad's comment, quoting Dennis Ritchie's history of C:

None of BCPL, B, or C supports character data strongly in the language; each treats strings much like vectors of integers and supplements general rules by a few conventions. In BCPL, the first packed byte contains the number of characters in the string; in B, there is no count and strings are terminated by a special character, which B spelled ‘*e’. This change was made partially to avoid the limitation on the length of a string caused by holding the count in an 8- or 9-bit slot, and partly because maintaining the count seemed, in our experience, less convenient than using a terminator.

The whole paper is a gold mine of information for the environment that Unix's early history took place in, and thus some of the factors that influenced its choices.

jamesbowman talks about some PDP-11 instruction set features that make null-terminated strings especially efficient. I suspect that these features were also in the PDP-7, which was what the very first Unix was written for.

(Note that, per the paper (and many other sources), Ken Thompson is the primary person who wrote the very earliest Unix versions, not Dennis Ritchie, so it was very likely Ken Thompson who made these pre-C string format decisions. But it sounds like the people at Bell Labs were all working together with each other in a common environment, and likely all influencing each other.)

HillR talks about DEC's PDP-11 and PDP-10 assemblers directly supporting this style of strings, echoing anonymous's comment here. Note that HillR is wrong about BCPL strings; per above, they had explicit lengths (and a length limit). At the very least the DEC assemblers show that null-terminated strings were an idea that was rattling around computing at the time. Since Bell Labs seem to have cross-built at least the PDP-7 version of Unix (from a non-DEC machine), I don't know how much they directly used the DEC assemblers.

Written on 07 January 2016.
« A fun Bash buffering bug (apparently on Linux only)
Getting to watch a significant spam campaign recently »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Thu Jan 7 02:21:26 2016
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.