A defense of C's null-terminated strings

December 31, 2015

In certain circles that I follow, it's lately been popular to frown on C's choice of simple null-terminated strings. Now, there are certainly any number of downsides to C's choice of how to represent strings (including that it makes any number of string operations not all that efficient, since they need to scan the string to determine its length), but as it happens I think that C made the right choice for the language it was (and is), and especially for the environment it was in. Namely, C is a relatively close to machine code language, especially in the beginning, and it was used on very small machines.

The other real option for representing strings is for them to have an explicit length field at a known spot (usually the start), plus of course their actual data. This is perfectly viable in C (people write libraries to do this), but putting it in the language as the official string representation adds complications, some of them immediately obvious and others somewhat more subtle. So, let's talk about those complications as a way of justifying C's choice.

The obvious issue is how big you make the length field and possibly how you encode it. A fixed size length field commits you to a maximum string size (if you make it 32 bits you're limited to 4GB strings, for example) and also potentially adds extra overhead for short strings (of which there are many). If you use a variable sized length encoding, you can somewhat avoid overhead and size limits at the cost of more complications when you need to look at the size or work out how much memory a particular sized string needs. C's null byte terminator imposes a fixed and essentially minimal size penalty regardless of the length of the string.

The indirect issue is that moving to a length-first string representation essentially forces you to create two new complex types in the language. To see why this is so, let's write some code in a Go-like language with inferred types:

string s = "abcdef";
void frob() {
   char *cp

   p := s+1    // refer to 'bcdef'
   cp = &s[2]  // refer to 'cdef'
   if (*cp == 'c') {

Clearly s is a normal length plus data string; it takes up six bytes plus however large the length field is. In the simple case it is basically:

typedef string struct {
   strsize_t _len
   char      _data[]


  • what type is p, and what is its machine representation?
  • is the assignment to cp valid or an error? Is cp a simple pointer, or does it have a more complicated representation?

One possible answer is that p is a char * and that a char * is a simple pointer. However, this is going to be very inconvenient because it means that any time you refer to something inside a string you lose the length of what you're dealing with and need to carry it around separately. You don't want to make p a string itself, because that would require allocating a variable amount of space and copying string data when p is created. What you need is what Go calls a slice type:

typedef strslice struct {
   strsize_t  _len
   char      *_dptr

This slice type carries with it the (adjusted) string length and a pointer to the data, instead of actually containing the data. From now on I'm going to assume that we have such a thing, because a language without it is pretty inconvenient (in effect you wind up making a slice type yourself, even if you just explicitly pass around a char * plus a length).

(In the traditional C style, it's your responsibility to manage memory lifetime so you don't free the underlying string as long as slices still point to some of it.)

You could make this slice type be what char * is, but then you'll want to create a new type that is a bare pointer to char, without a string length attached to it. And you probably want such a type, and you want to be able to get char * pointers from strings so that you can do operations like stepping through strings with maximum efficiency.

Our p type has a subtle inconvenience: because it fixes the string length, a p reference to a string does not resize if the underlying string has been extended (in the C tradition it doesn't notice if the string has been shrunk so the p is now past the end, either). Extending a string in place is a not uncommon C idiom, partly because it is more efficient than constantly allocating and freeing memory; it's a pity that our setup doesn't support that as it stands.

Since p is a multi-word type, we also have the problem of how to pass it around when making function calls. We either want our neo-C to have efficient struct passing or to always pass basic pointers to p's, with called functions doing explicit copying if they want to keep them. Oh, and clearly we're going to want a neo-C that has good support for copying multi-word types, as it would be terrible usability to have to write:

slice s2
memcpy(&s2, &p, sizeof(p))

(I already kind of assumed this when I wrote 'p := s+1'.)

The third level of issues is that this makes all of the str* routines relatively special magic, because they must reach inside this complex string type to manipulate the individual fields. We will probably also want to make a lot of the str* functions take slices (or pointers to slices) as arguments, which means either implicitly or explicitly creating slices from (pointers to) strings in various places in the code that people will write.

We could get around some of these problems by saying that the string type is actually what I've called the slice type here, ie all strings contain a pointer to the string data instead of directly embedding it. But then this adds the overhead of a pointer to all strings, even constant strings.

(It doesn't require separate memory allocations, since you can just put the string data immediately after the slice structure and allocate both at once.)

You can make all of this work and modern languages do (as do those string libraries for C). But I think it's clear that a neo-C with explicit-length strings would have been significantly more complicated from the get go than the C that we got, especially very early on in C's lifetime. Null terminated strings are the easiest approach and they require the least magic from the language and the runtime. Explicit length strings would also have been less memory efficient and more demanding on what were very small machines, ones where every byte might count.

(They are not just more demanding in data memory, they require more code to do things like maintain and manipulate string and slice lengths. And this code is invisible, unless you make programmers code it all explicitly by eg not having a slice type.)

Comments on this page:

By Alan at 2016-01-01 11:31:56:

"C should have done more work up front" is basically the argument for the "MIT Approach". Similar to Tony Hoare's "billion dollar mistake", the idea of a NULL reference (pointer).

Once you've de-prioritized safety, I think you're right. Nul termination is

1. Simplest. Only one string type, serving double-duty as an iterator - in some ways like LISP-style lists. Stored as a single pointer. Wait, the language already has a pointer type... and we're done. ("There is no string type in C").

2. Compact (in RAM).

I interpret the complaint as "why are we still using stupid language that totally de-prioritize safety? Can we use a better language for this project?"

I think you could add a general pointer+length type, for when checking correctness becomes more significant. When you write application software in it, perhaps :). D is the first example to have array slices that I can think, of so this might be slightly a-historical.

It would be a bigger language, but it could still be C.

The problem that comes to my mind is bounds-checking overhead... the bulk of the generated code probably bloats up with all the implicit operations. You don't have an optimizing compiler yet, it's supposed to be portable assembly. If the programmer wants optimization, they have to do it themselves.

Then to succeed at a similar level to "worse is better", you're waiting for portable optimizing compilers. (GCC in 1990? Also note GNU is explicitly written to take advantage of larger memory sizes).

While we're there, why not non-null references and a Nullable wrapper (Option/Maybe ADT) as well? I think those ideas are more clearly a-historical. However the implementation might be easier. C++ is currently trying for all of this and more (C++ "Core Guidelines" draft).

You have to wait somewhere around '95-'05, before it became impossible to deny just how dangerous the language is without them. Would have been real nice if we had a drop-in solution for it.

The obvious issue is how big you make the length field and possibly how you encode it. A fixed size length field commits you to a maximum string size (if you make it 32 bits you're limited to 4GB strings, for example)

This is a pretty minor part of a post I otherwise mostly agree with, but I don't think this particular one is a real issue. C strings already commit to a maximum string size by virtue of being addressed through a fixed-size char pointer. So if you just make the length field the same size as whatever a char pointer is on your platform, you haven't added a further limit.

By cks at 2016-01-02 15:59:42:

You can certainly avoid having any real limit by making the length field sufficiently big, but then this comes at the cost of wasting space when you have small strings. If you commit to a 64-bit length field on a 64-bit machine, strings of 255 characters or less have 7 bytes of overhead, under 64k strings have six bytes of overhead, and so on. And most of you strings are likely to be short, so you're paying a relatively high overhead for the rarely used possibility of really long strings.

On modern machines this may not be a big issue. But C was originally written on machines with 64 KB of data space (and often less RAM), and there I suspect that a one-byte overhead on many or almost all strings would have made people grumpy.

By cks at 2016-01-02 16:03:16:

Alan: my assumption is that this neo-C would not do bounds checks on string operations any more than C does bounds checks on operations on fixed size arrays. The explicit length field would be used purely to make strlen() fast (along with anything that depends on it) and to allow null bytes to occur in strings.

Considering that Forth is an earlier language intended and designed to run on more constrained hardware than C, I feel it can be used as a counterexample.

In ANSI Forth, s" is generally the preferred word to create strings. It stores the string itself and a cell for the length, both of which are returned on the stack for later use. A cell must be at least 16 bits large. s" is a parsing word and can't be used to define nontextual data.

A conforming program is not allowed to store into or modify such strings.

Now, why it is that C can't afford this cost, yet Forth can is a mystery I will try to explain.

Forth, unlike C, is untyped and the distinction between valid representations of data for different operations can be blurred. This is brought up because Forth does not actually define a general array type. You can allocate data and use pointers to it, but the rest is up to the programmer. C, having a dizzying numbers of types for simply numbers, could have very easily defined different types for string lengths, but did not.

Perhaps Dennis Ritchie just wasn't interested in time-efficiency.

This could also be for a variety of other reasons, such as C's type system causing array decay or a C procedure's inability to return multiple values.

Regardless, the fact remains that C is very strange in the realm of efficient and lower-level programming languages, which find the space for the length of their arrays.

Written on 31 December 2015.
« How I've wound up taking my notes
I've realized I need to change how I read Twitter »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Thu Dec 31 23:38:26 2015
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.