Wandering Thoughts archives

2015-12-31

A defense of C's null-terminated strings

In certain circles that I follow, it's lately been popular to frown on C's choice of simple null-terminated strings. Now, there are certainly any number of downsides to C's choice of how to represent strings (including that it makes any number of string operations not all that efficient, since they need to scan the string to determine its length), but as it happens I think that C made the right choice for the language it was (and is), and especially for the environment it was in. Namely, C is a relatively close to machine code language, especially in the beginning, and it was used on very small machines.

The other real option for representing strings is for them to have an explicit length field at a known spot (usually the start), plus of course their actual data. This is perfectly viable in C (people write libraries to do this), but putting it in the language as the official string representation adds complications, some of them immediately obvious and others somewhat more subtle. So, let's talk about those complications as a way of justifying C's choice.

The obvious issue is how big you make the length field and possibly how you encode it. A fixed size length field commits you to a maximum string size (if you make it 32 bits you're limited to 4GB strings, for example) and also potentially adds extra overhead for short strings (of which there are many). If you use a variable sized length encoding, you can somewhat avoid overhead and size limits at the cost of more complications when you need to look at the size or work out how much memory a particular sized string needs. C's null byte terminator imposes a fixed and essentially minimal size penalty regardless of the length of the string.

The indirect issue is that moving to a length-first string representation essentially forces you to create two new complex types in the language. To see why this is so, let's write some code in a Go-like language with inferred types:

string s = "abcdef";
void frob() {
   char *cp

   p := s+1    // refer to 'bcdef'
   cp = &s[2]  // refer to 'cdef'
   if (*cp == 'c') {
      cp++;
      [...]
   }
   [...]
}

Clearly s is a normal length plus data string; it takes up six bytes plus however large the length field is. In the simple case it is basically:

typedef string struct {
   strsize_t _len
   char      _data[]
}

But:

  • what type is p, and what is its machine representation?
  • is the assignment to cp valid or an error? Is cp a simple pointer, or does it have a more complicated representation?

One possible answer is that p is a char * and that a char * is a simple pointer. However, this is going to be very inconvenient because it means that any time you refer to something inside a string you lose the length of what you're dealing with and need to carry it around separately. You don't want to make p a string itself, because that would require allocating a variable amount of space and copying string data when p is created. What you need is what Go calls a slice type:

typedef strslice struct {
   strsize_t  _len
   char      *_dptr
}

This slice type carries with it the (adjusted) string length and a pointer to the data, instead of actually containing the data. From now on I'm going to assume that we have such a thing, because a language without it is pretty inconvenient (in effect you wind up making a slice type yourself, even if you just explicitly pass around a char * plus a length).

(In the traditional C style, it's your responsibility to manage memory lifetime so you don't free the underlying string as long as slices still point to some of it.)

You could make this slice type be what char * is, but then you'll want to create a new type that is a bare pointer to char, without a string length attached to it. And you probably want such a type, and you want to be able to get char * pointers from strings so that you can do operations like stepping through strings with maximum efficiency.

Our p type has a subtle inconvenience: because it fixes the string length, a p reference to a string does not resize if the underlying string has been extended (in the C tradition it doesn't notice if the string has been shrunk so the p is now past the end, either). Extending a string in place is a not uncommon C idiom, partly because it is more efficient than constantly allocating and freeing memory; it's a pity that our setup doesn't support that as it stands.

Since p is a multi-word type, we also have the problem of how to pass it around when making function calls. We either want our neo-C to have efficient struct passing or to always pass basic pointers to p's, with called functions doing explicit copying if they want to keep them. Oh, and clearly we're going to want a neo-C that has good support for copying multi-word types, as it would be terrible usability to have to write:

slice s2
memcpy(&s2, &p, sizeof(p))

(I already kind of assumed this when I wrote 'p := s+1'.)

The third level of issues is that this makes all of the str* routines relatively special magic, because they must reach inside this complex string type to manipulate the individual fields. We will probably also want to make a lot of the str* functions take slices (or pointers to slices) as arguments, which means either implicitly or explicitly creating slices from (pointers to) strings in various places in the code that people will write.

We could get around some of these problems by saying that the string type is actually what I've called the slice type here, ie all strings contain a pointer to the string data instead of directly embedding it. But then this adds the overhead of a pointer to all strings, even constant strings.

(It doesn't require separate memory allocations, since you can just put the string data immediately after the slice structure and allocate both at once.)

You can make all of this work and modern languages do (as do those string libraries for C). But I think it's clear that a neo-C with explicit-length strings would have been significantly more complicated from the get go than the C that we got, especially very early on in C's lifetime. Null terminated strings are the easiest approach and they require the least magic from the language and the runtime. Explicit length strings would also have been less memory efficient and more demanding on what were very small machines, ones where every byte might count.

(They are not just more demanding in data memory, they require more code to do things like maintain and manipulate string and slice lengths. And this code is invisible, unless you make programmers code it all explicitly by eg not having a slice type.)

programming/CNullStringsDefense written at 23:38:26; Add Comment

How I've wound up taking my notes

If you're going to take and keep notes, you need some way to actually do this. I'm not going to claim that my system is at all universal; it's simply what has worked so far for me. I'm a brute force Unix kind of person, so my system is built on simple basic Unix things.

My basic form of note taking is plain ASCII in Unix files. I don't version control them (although maybe I should), but instead I treat them as logs where I only append new information to the end rather than rewriting existing sections (although sometimes I'll add an update note directly in place next to some information I later found out was incorrect). The honest reason why I take this append only approach is that it's easier to write, but I can justify with it as creating a useful record of my thinking at the time.

(The exception to this is when I'm writing and testing things like build instructions or migration checklists. If my output is going to be documentation for other people, it obviously has to get rewritten in place so the end result is coherent.)

Some but not all of the time I will date new additions to files (in simple '2015-12-30' form). This helps me keep track of when I did something and also how long it's been since I worked on something. Although I often used to just summarize the commands I was using and the output I was getting, I have tried lately to literally copy and paste both commands and output in. I've found that this is handy for being lazy when repeating things; I can just copy-and-paste commands from the file into a terminal window.

I use file names that make sense to me, although not necessarily to anyone else. Typical file names are things like bsdtcp-restores and cs8-oldmail-weirdness. Often I'll put a summary of the project or issue at the top of a file, so that if (or when) I look at it much later I can remember what it was about. For lab notebook stuff I tend to put the date of the initial incident in the filename, but I'm kind of inconsistent in this.

I've found it useful to segregate my notes files into more or less three directories. One is for (active) projects, one is for general notes on various things, and one is for lab notebook stuff done during (semi-)crisis situations. In all of those directories I have subdirectories for files that are complete, or over, or now obsolete for various reasons. All of these old files remain valuable so I keep them, but I try to keep the top level directories only having current things (especially for the projects directory). I sometimes rename files when I move them into subdirectories because I realize that my initial file name is not a good one for future reference (often it turns out to be too generic).

I don't currently keep any of these files under version control. Maybe I should, but at the moment it feels like overkill given that I never want to delete things and I don't really have a situation where I want 'revert to (or look at) previous version of a file'. Many things are implicitly versioned just by me having multiple files and starting new ones for new situations, even if I copy things from an older file.

(For example, the test plan for upgrading our mail server from Ubuntu 10.04 to 12.04 is in a different file than the test plan for upgrading it from 12.04 to 14.04, even though I created the latter from the former.)

As to where these files all live: their master location is in my home directory on our fileservers. On the rare occasion that I need to refer to or work on one of them when our fileservers or our Linux servers are going to be down, I rsync the relevant file to my office workstation and work on it there, then rsync it back afterwards. These are fortunately not the sort of notes that I'd be looking at if our entire infrastructure fell down. Putting them in my fileserver home directory means that they're automatically available on all of our Unix servers and they get backed up via our backup system and so on.

As for the editor I use, well, vi is my sysadmin editor. But the choice of editor doesn't really matter here (and sometimes I use others).

PS: I'm lucky enough that none of my notes files need to be kept so secret that they need to be encrypted. I don't know what I'd do if I needed that for some of my notes, and given that encryption is generally a pain I hope that I never have to find out.

sysadmin/HowITakeNotes written at 01:25:19; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.