== A defense of C's null-terminated strings In certain circles that I follow, it's lately been popular to frown on C's choice of simple null-terminated strings. Now, there are certainly any number of downsides to C's choice of how to represent strings (including that it makes any number of string operations not all that efficient, since they need to scan the string to determine its length), but as it happens I think that C made the right choice for the language it was (and is), and especially for the environment it was in. Namely, C is a relatively close to machine code language, especially in the beginning, and it was used on very small machines. The other real option for representing strings is for them to have an explicit length field at a known spot (usually the start), plus of course their actual data. This is perfectly viable in C ([[people write libraries to do this /Software/bstring]]), but putting it in the language as the official string representation adds complications, some of them immediately obvious and others somewhat more subtle. So, let's talk about those complications as a way of justifying C's choice. The obvious issue is how big you make the length field and possibly how you encode it. A fixed size length field commits you to a maximum string size (if you make it 32 bits you're limited to 4GB strings, for example) and also potentially adds extra overhead for short strings (of which there are many). If you use a variable sized length encoding, you can somewhat avoid overhead and size limits at the cost of more complications when you need to look at the size or work out how much memory a particular sized string needs. C's null byte terminator imposes a fixed and essentially minimal size penalty regardless of the length of the string. The indirect issue is that moving to a length-first string representation essentially forces you to create two new complex types in the language. To see why this is so, let's write some code in a Go-like language with inferred types: .pn prewrap on > string s = "abcdef"; > > void frob() { > char *cp > > p := s+1 // refer to 'bcdef' > cp = &s[2] // refer to 'cdef' > if (*cp == 'c') { > cp++; > [...] > } > [...] > } Clearly _s_ is a normal length plus data string; it takes up six bytes plus however large the length field is. In the simple case it is basically: > typedef string struct { > strsize_t _len > char _data[] > } But: * what type is _p_, and what is its machine representation? * is the assignment to _cp_ valid or an error? Is _cp_ a simple pointer, or does it have a more complicated representation? One possible answer is that _p_ is a ((char *)) and that a ((char *)) is a simple pointer. However, this is going to be very inconvenient because it means that any time you refer to something inside a string you lose the length of what you're dealing with and need to carry it around separately. You don't want to make _p_ a _string_ itself, because that would require allocating a variable amount of space and copying string data when _p_ is created. What you need is what Go calls a slice type: > typedef strslice struct { > strsize_t _len > char *_dptr > } This slice type carries with it the (adjusted) string length and a pointer to the data, instead of actually containing the data. From now on I'm going to assume that we have such a thing, because a language without it is pretty inconvenient (in effect you wind up making a slice type yourself, even if you just explicitly pass around a ((char *)) plus a length). (In the traditional C style, it's your responsibility to manage memory lifetime so you don't free the underlying string as long as slices still point to some of it.) You could make this slice type be what ((char *)) is, but then you'll want to create a new type that is a bare pointer to _char_, without a string length attached to it. And you probably want such a type, and you want to be able to get ((char *)) pointers from strings so that you can do operations like stepping through strings with maximum efficiency. Our _p_ type has a subtle inconvenience: because it fixes the string length, a _p_ reference to a string does not resize if the underlying string has been extended (in the C tradition it doesn't notice if the string has been shrunk so the _p_ is now past the end, either). Extending a string in place is a not uncommon C idiom, partly because it is more efficient than constantly allocating and freeing memory; it's a pity that our setup doesn't support that as it stands. Since _p_ is a multi-word type, we also have the problem of how to pass it around when making function calls. We either want our neo-C to have efficient _struct_ passing or to always pass basic pointers to _p_'s, with called functions doing explicit copying if they want to keep them. Oh, and clearly we're going to want a neo-C that has good support for copying multi-word types, as it would be terrible usability to have to write: > slice s2 > memcpy(&s2, &p, sizeof(p)) (I already kind of assumed this when I wrote '((p := s+1))'.) The third level of issues is that this makes all of the ((str*)) routines relatively special magic, because they must reach inside this complex _string_ type to manipulate the individual fields. We will probably also want to make a lot of the ((str*)) functions take slices (or pointers to slices) as arguments, which means either implicitly or explicitly creating slices from (pointers to) strings in various places in the code that people will write. We could get around some of these problems by saying that the _string_ type is actually what I've called the slice type here, ie all strings contain a pointer to the string data instead of directly embedding it. But then this adds the overhead of a pointer to all strings, even constant strings. (It doesn't require separate memory allocations, since you can just put the string data immediately after the slice structure and allocate both at once.) You can make all of this work and modern languages do (as do those string libraries for C). But I think it's clear that a neo-C with explicit-length strings would have been significantly more complicated from the get go than the C that we got, especially very early on in C's lifetime. Null terminated strings are the easiest approach and they require the least magic from the language and the runtime. Explicit length strings would also have been less memory efficient and more demanding on what were very small machines, ones where every byte might count. (They are not just more demanding in data memory, they require more code to do things like maintain and manipulate string and slice lengths. And this code is invisible, unless you make programmers code it all explicitly by eg not having a slice type.)