== The format of strings in early (pre-C) Unix
The very earliest version of Unix was written before C was created
and even after C's creation the whole system wasn't rewritten in
it immediately. Courtesy of [[the Unix Heritage Society
http://www.tuhs.org/]], much of the surviving source code from this
era is available online. It doesn't make up a complete source tree
for any of the early Research Unixes, but it does let us peek back
in time to read code and documentation that was written in that
pre-C era.
In light of [[a recent entry on C strings
../programming/CNullStringsDefense]], I became curious about what
the format of strings was in Unix back before C existed. Even in
the pre-C era, the kernel and assembly language programs needed
strings for some things; for example, system calls like _creat()_
and _open()_ have to take filaname arguments in some form, and
programs often have constant strings for messages that they'll print
out. So I went and looked at early Unix source and documentation,
for Research V1 (entirely pre-C), Research V2, and Research V3.
I will skip to the punchline:
> ~~Unix strings have been null-terminated from the very beginning
> of Unix, even before C existed~~.
Unix did not get null-terminated strings from C. Instead, C got
null-terminated strings from Unix (specifically, Research V1
Unix). I don't know where V1 Unix got them from, if anywhere.
There's plenty of traces of this in [[the surviving Research V1
files http://minnie.tuhs.org/cgi-bin/utree.pl?file=V1]]. For instance,
the [[V1 _creat_ manpage
http://minnie.tuhs.org/cgi-bin/utree.pl?file=V1/man/man2/creat.2]] says:
> creat creates a new file or prepares to rewrite an existing file
> called name; name is the address of a null--terminated string. [...]
The [[V1 shell http://minnie.tuhs.org/cgi-bin/utree.pl?file=V1/sh.s]]
also contains uses of null-terminated strings. These are written with
an interesting notation:
.pn prewrap on
> [...]
> bec 1f / branch if no error
> jsr r5,error / error in file name
> ; .even
> sys exit
> [...]
> qchdir:
>
> glogin:
>
> [...]
Not all strings in the shell are null-terminated in this way,
probably because it was natural to have their lengths just known
in the code. If we need more confirmation, the _error_ function
specifically comments that a _0_ byte is the end of the 'line' (here
a string):
> error:
> movb (r5)+,och / pick up diagnostic character
> beq 1f / 0 is end of line
> mov $1,r0 / set for tty output
> sys write; och; 1 / print it
> br error / continue to get characters
> 1:
> [... goes on ...]
I suspect that one reason this format for strings was adopted was
simply that it was easy to express and support in the assembler.
Based on the usage here, a string was simply a '_<....>_' block
that supported some escape sequences, including _\0_ for a null
byte; presumably this was basically copied straight into the object
file after escape translation. There's no need for either the
assembler or the programmer to count up the string length and then
get that too into the object code somehow.
(It turns out that [[the V1 _as_ manpage
http://minnie.tuhs.org/cgi-bin/utree.pl?file=V1/man/man1/as.1]]
documents all of this.)
PS: it's interesting that although [[the V1 _write_ system call
http://minnie.tuhs.org/cgi-bin/utree.pl?file=V1/man/man2/write.2]]
supports writing many bytes at once, the _error_ code here simply
does brute force one character at a time output. Presumably that
was just simpler to code.
~~Update~~: See the comments for interesting additional information
and pointers. Other people have added a bunch of good stuff.