A defense of C's null-terminated strings
In certain circles that I follow, it's lately been popular to frown on C's choice of simple null-terminated strings. Now, there are certainly any number of downsides to C's choice of how to represent strings (including that it makes any number of string operations not all that efficient, since they need to scan the string to determine its length), but as it happens I think that C made the right choice for the language it was (and is), and especially for the environment it was in. Namely, C is a relatively close to machine code language, especially in the beginning, and it was used on very small machines.
The other real option for representing strings is for them to have an explicit length field at a known spot (usually the start), plus of course their actual data. This is perfectly viable in C (people write libraries to do this), but putting it in the language as the official string representation adds complications, some of them immediately obvious and others somewhat more subtle. So, let's talk about those complications as a way of justifying C's choice.
The obvious issue is how big you make the length field and possibly how you encode it. A fixed size length field commits you to a maximum string size (if you make it 32 bits you're limited to 4GB strings, for example) and also potentially adds extra overhead for short strings (of which there are many). If you use a variable sized length encoding, you can somewhat avoid overhead and size limits at the cost of more complications when you need to look at the size or work out how much memory a particular sized string needs. C's null byte terminator imposes a fixed and essentially minimal size penalty regardless of the length of the string.
The indirect issue is that moving to a length-first string representation essentially forces you to create two new complex types in the language. To see why this is so, let's write some code in a Go-like language with inferred types:
string s = "abcdef";void frob() { char *cp p := s+1 // refer to 'bcdef' cp = &s[2] // refer to 'cdef' if (*cp == 'c') { cp++; [...] } [...] }
Clearly s
is a normal length plus data string; it takes up six
bytes plus however large the length field is. In the simple case
it is basically:
typedef string struct { strsize_t _len char _data[] }
But:
- what type is
p
, and what is its machine representation? - is the assignment to
cp
valid or an error? Iscp
a simple pointer, or does it have a more complicated representation?
One possible answer is that p
is a char *
and that a
char *
is a simple pointer. However, this is going to be very
inconvenient because it means that any time you refer to something
inside a string you lose the length of what you're dealing with and
need to carry it around separately. You don't want to make p
a
string
itself, because that would require allocating a variable
amount of space and copying string data when p
is created. What
you need is what Go calls a slice type:
typedef strslice struct { strsize_t _len char *_dptr }
This slice type carries with it the (adjusted) string length and a
pointer to the data, instead of actually containing the data. From
now on I'm going to assume that we have such a thing, because a
language without it is pretty inconvenient (in effect you wind up
making a slice type yourself, even if you just explicitly pass around
a char *
plus a length).
(In the traditional C style, it's your responsibility to manage memory lifetime so you don't free the underlying string as long as slices still point to some of it.)
You could make this slice type be what char *
is, but then
you'll want to create a new type that is a bare pointer to char
,
without a string length attached to it. And you probably want such
a type, and you want to be able to get char *
pointers from
strings so that you can do operations like stepping through strings
with maximum efficiency.
Our p
type has a subtle inconvenience: because it fixes the string
length, a p
reference to a string does not resize if the underlying
string has been extended (in the C tradition it doesn't notice if
the string has been shrunk so the p
is now past the end, either).
Extending a string in place is a not uncommon C idiom, partly because
it is more efficient than constantly allocating and freeing memory;
it's a pity that our setup doesn't support that as it stands.
Since p
is a multi-word type, we also have the problem of how to
pass it around when making function calls. We either want our neo-C
to have efficient struct
passing or to always pass basic pointers
to p
's, with called functions doing explicit copying if they want
to keep them. Oh, and clearly we're going to want a neo-C that has
good support for copying multi-word types, as it would be terrible
usability to have to write:
slice s2 memcpy(&s2, &p, sizeof(p))
(I already kind of assumed this when I wrote 'p := s+1
'.)
The third level of issues is that this makes all of the str*
routines relatively special magic, because they must reach inside
this complex string
type to manipulate the individual fields. We
will probably also want to make a lot of the str*
functions
take slices (or pointers to slices) as arguments, which means either
implicitly or explicitly creating slices from (pointers to) strings
in various places in the code that people will write.
We could get around some of these problems by saying that the
string
type is actually what I've called the slice type here, ie
all strings contain a pointer to the string data instead of directly
embedding it. But then this adds the overhead of a pointer to all
strings, even constant strings.
(It doesn't require separate memory allocations, since you can just put the string data immediately after the slice structure and allocate both at once.)
You can make all of this work and modern languages do (as do those string libraries for C). But I think it's clear that a neo-C with explicit-length strings would have been significantly more complicated from the get go than the C that we got, especially very early on in C's lifetime. Null terminated strings are the easiest approach and they require the least magic from the language and the runtime. Explicit length strings would also have been less memory efficient and more demanding on what were very small machines, ones where every byte might count.
(They are not just more demanding in data memory, they require more code to do things like maintain and manipulate string and slice lengths. And this code is invisible, unless you make programmers code it all explicitly by eg not having a slice type.)
Comments on this page:
|
|