Wandering Thoughts archives

2015-12-31

A defense of C's null-terminated strings

In certain circles that I follow, it's lately been popular to frown on C's choice of simple null-terminated strings. Now, there are certainly any number of downsides to C's choice of how to represent strings (including that it makes any number of string operations not all that efficient, since they need to scan the string to determine its length), but as it happens I think that C made the right choice for the language it was (and is), and especially for the environment it was in. Namely, C is a relatively close to machine code language, especially in the beginning, and it was used on very small machines.

The other real option for representing strings is for them to have an explicit length field at a known spot (usually the start), plus of course their actual data. This is perfectly viable in C (people write libraries to do this), but putting it in the language as the official string representation adds complications, some of them immediately obvious and others somewhat more subtle. So, let's talk about those complications as a way of justifying C's choice.

The obvious issue is how big you make the length field and possibly how you encode it. A fixed size length field commits you to a maximum string size (if you make it 32 bits you're limited to 4GB strings, for example) and also potentially adds extra overhead for short strings (of which there are many). If you use a variable sized length encoding, you can somewhat avoid overhead and size limits at the cost of more complications when you need to look at the size or work out how much memory a particular sized string needs. C's null byte terminator imposes a fixed and essentially minimal size penalty regardless of the length of the string.

The indirect issue is that moving to a length-first string representation essentially forces you to create two new complex types in the language. To see why this is so, let's write some code in a Go-like language with inferred types:

string s = "abcdef";
void frob() {
   char *cp

   p := s+1    // refer to 'bcdef'
   cp = &s[2]  // refer to 'cdef'
   if (*cp == 'c') {
      cp++;
      [...]
   }
   [...]
}

Clearly s is a normal length plus data string; it takes up six bytes plus however large the length field is. In the simple case it is basically:

typedef string struct {
   strsize_t _len
   char      _data[]
}

But:

  • what type is p, and what is its machine representation?
  • is the assignment to cp valid or an error? Is cp a simple pointer, or does it have a more complicated representation?

One possible answer is that p is a char * and that a char * is a simple pointer. However, this is going to be very inconvenient because it means that any time you refer to something inside a string you lose the length of what you're dealing with and need to carry it around separately. You don't want to make p a string itself, because that would require allocating a variable amount of space and copying string data when p is created. What you need is what Go calls a slice type:

typedef strslice struct {
   strsize_t  _len
   char      *_dptr
}

This slice type carries with it the (adjusted) string length and a pointer to the data, instead of actually containing the data. From now on I'm going to assume that we have such a thing, because a language without it is pretty inconvenient (in effect you wind up making a slice type yourself, even if you just explicitly pass around a char * plus a length).

(In the traditional C style, it's your responsibility to manage memory lifetime so you don't free the underlying string as long as slices still point to some of it.)

You could make this slice type be what char * is, but then you'll want to create a new type that is a bare pointer to char, without a string length attached to it. And you probably want such a type, and you want to be able to get char * pointers from strings so that you can do operations like stepping through strings with maximum efficiency.

Our p type has a subtle inconvenience: because it fixes the string length, a p reference to a string does not resize if the underlying string has been extended (in the C tradition it doesn't notice if the string has been shrunk so the p is now past the end, either). Extending a string in place is a not uncommon C idiom, partly because it is more efficient than constantly allocating and freeing memory; it's a pity that our setup doesn't support that as it stands.

Since p is a multi-word type, we also have the problem of how to pass it around when making function calls. We either want our neo-C to have efficient struct passing or to always pass basic pointers to p's, with called functions doing explicit copying if they want to keep them. Oh, and clearly we're going to want a neo-C that has good support for copying multi-word types, as it would be terrible usability to have to write:

slice s2
memcpy(&s2, &p, sizeof(p))

(I already kind of assumed this when I wrote 'p := s+1'.)

The third level of issues is that this makes all of the str* routines relatively special magic, because they must reach inside this complex string type to manipulate the individual fields. We will probably also want to make a lot of the str* functions take slices (or pointers to slices) as arguments, which means either implicitly or explicitly creating slices from (pointers to) strings in various places in the code that people will write.

We could get around some of these problems by saying that the string type is actually what I've called the slice type here, ie all strings contain a pointer to the string data instead of directly embedding it. But then this adds the overhead of a pointer to all strings, even constant strings.

(It doesn't require separate memory allocations, since you can just put the string data immediately after the slice structure and allocate both at once.)

You can make all of this work and modern languages do (as do those string libraries for C). But I think it's clear that a neo-C with explicit-length strings would have been significantly more complicated from the get go than the C that we got, especially very early on in C's lifetime. Null terminated strings are the easiest approach and they require the least magic from the language and the runtime. Explicit length strings would also have been less memory efficient and more demanding on what were very small machines, ones where every byte might count.

(They are not just more demanding in data memory, they require more code to do things like maintain and manipulate string and slice lengths. And this code is invisible, unless you make programmers code it all explicitly by eg not having a slice type.)

CNullStringsDefense written at 23:38:26; Add Comment

2015-12-25

I like Magit, the GNU Emacs package for Git

As part of slowly improving my GNU Emacs environment, I recently decided to experiment with magit, a very well regarded Emacs package for working with Git. I'm not attempting to adopt it as my primary interface to Git (I'm not a 'do everything from inside Emacs' person); instead, I have relatively specific situations where I was hoping magit would be the best way to work with Git. Although I haven't used it extensively yet, I can definitely say that Magit has been great in the areas where I was hoping that it would be.

For me, the big Magit feature and selling point is selectively adding things to the Git index (that is, staging them before committing them). Ordinary command line Git has 'git add -p', but this is a relatively awkward interface that I have had blow up in my face before. Magit offers a very powerful interface for selectively staging changes; you can easily stage not just individual chunks of the diff but even individual lines and collections of them. This is a very powerful way of straightening out a tangle of changes into logical changesets, even if they don't come out in separate chunks.

Part of why Magit makes this relatively easy is that it gives you a good live display of the staged and unstaged diffs as you are working away. This means you get immediate feedback if something looks wrong (or incomplete), and you can also reverse a decision right away. With 'git add -p' I found that I didn't really get this, since it's pretty much a linear process through all the chunks of diffs. In fact being non-linear is another advantage of Magit here; you can see the full set of diffs and skip around it to spot bits that you want to commit first.

Magit also has a relatively nice interface for just making commits. You get an Emacs buffer to edit the commit message (with standard general Emacs features like spell checking) and it shows you the staged diff that will be committed at the same time, which is handy. I can pretty much get all of this at the command line with multiple windows, but if I'm already in Emacs for editing files I often might as well commit from Emacs as well.

I have much less exposure to other Magit features, at least some of which I'm not terribly interested in (at least right now, I may change my mind later). How you invoke Magit commands initially struck me as a bit weird and confusing, but I'm getting used to it with time. People who regularly use Magit will likely find it completely natural.

So, in summary: Magit rocks for selectively staging changes, so much so that I think it's worth turning to GNU Emacs with Magit for this if you need it, even if you regularly edit in something else. If you regularly use GNU Emacs anyways I think the other parts of Magit are useful enough to learn at least the basics of looking at diffs and making straightforward commits.

(Although I spend a lot of time in vi, GNU Emacs remains my favorite editor for working on code in most languages for reasons beyond the scope of this entry.)

MagitPraise written at 01:38:11; Add Comment

2015-12-14

Some things that force Go to call the C library for name resolution on Linux

Traditionally on Unix systems there is no official standard for how to do various sorts of network name lookups, or rather the official standard is 'call the functions in the system C library'. There is generally no actual specification for how name lookup works at the level that would permit you to create an independent implementation (although there is generally documentation on how to configure it). This presents a problem for people who are creating non-C languages; they must either arrange to call the C library (through the C library's limited interfaces for this) or write their own versions that may not resolve things exactly like the C library does, so that you get inconsistent behavior between C programs and programs written in the other language.

Go kind of takes both approaches. As covered in the net package's documentation, Go can both call the C library routines (using cgo) and do its own name lookups in pure Go. Go normally tries to use the pure Go approach as much as possible because it's considered better (partly because cgo calls can be relatively expensive). In theory the pure Go approach should give you the same results as the cgo approach; in practice, the two can behave somewhat differently in some situations, sometimes because of oversights.

(Although the net package's documentation talks only about DNS related lookups, this also affects how at least net.LookupPort() works.)

Go normally attempts to be pretty hyper-intelligent about whether or not it can use its pure Go lookup functions. It makes this decision in part by reading through your /etc/resolv.conf and /etc/nsswitch.conf to see if you're using anything that it doesn't think it can handle. This raises the question of what things in either of these files can accidentally force Go to use cgo calls to the C library, instead of its own more efficient (and more consistent across systems) pure Go version. For /etc/resolv.conf, Go understands all of the common things but anything else will force it to cgo, including any mistakes you may have lurking in there. For /etc/nsswitch.conf, Go looks at the 'hosts' line and a few complications can be common on modern Linuxes:

  • if your hosts includes myhostname, only lookups of names with dots in them can be done in pure Go. Because of an implementation quirk, this currently means that net.LookupPort() is forced to use the C library.

    (Some other things are also forced to use the C library, but arguably they should in this situation because they involve hostnames.)

  • if your hosts includes mymachines, all lookups go to the C library. This is probably common on modern systemd-based Linux distributions.

If you're using Go programs and you don't use containers or don't need the magic functionality of mymachines, you may want to strip it out of your nsswitch.conf. If you're like me, you may even be surprised to find it there in the first place. You may not want myhostname either, especially if your host has IP aliases that are most definitely not included in what a name to IP lookup for its hostname should return.

Note that contrary to what you might think, net.LookupPort() (and things that call it to get ports, like net.ResolveTCPAddr()) does not look at the services line in /etc/nsswitch.conf, only the hosts line. And of course the pure Go port lookup only looks at /etc/services (and may not parse it exactly like the C library does). At the moment a missing or unreadable /etc/services seems not to force a fallback to the C library but instead uses a tiny built in default version.

(This probably doesn't affect anyone. I can't imagine that there are very many people who use NIS or otherwise do something funny for services lookups and not hosts lookups.)

You can see what sort of lookup is used for specific queries by setting the GODEBUG environment variable to a verbosity of 2 or more, ie 'GODEBUG=netdns=2'. The resulting report may look something like this:

go package net: dynamic selection of DNS resolver
go package net: hostLookupOrder() = cgo
go package net: hostLookupOrder(smtp.cs) = files,dns

This is on a Linux machine where I set hosts to include myhostname but not mymachines. The first hostLookupOrder() is for looking up the port number for smtp; here the presence of myhostname forced it to resort to cgo. A blank argument to hostLookupOrder() is used by several sorts of lookups, including net.LookupAddr() and net.LookupCNAME().

(This is of course an implementation detail and may change at some point.)

GoNetLookupsCgoAndLinux written at 22:39:43; Add Comment

2015-12-09

Goroutines, network IO streams, and the resulting synchronization problem

These days, I use my Go take on a netcat-like program for all of my casual network (TCP) service testing. One of the things that makes it convenient is that it supports TLS, so I can just do things like 'call tls imap.cs imaps' instead of having to dabble in yet another netcat-like program. However, this TLS support has an important and unfortunate limit right now, which is that it only supports protocols where you do TLS from the start. A certain number of protocols support a during-connection upgrade to TLS, generally via some form of what gets called STARTTLS; the one most relevant to me is SMTP.

I would like to support STARTTLS in call, but this exposes a structural issue. Call is a netcopy program and these are naturally asynchronous in that copying from standard input to the network is (and often should be) completely decoupled from copying from the network to standard output. Call implements this in the straightforward Go way by using two goroutines, one for each copy. This works quite well normally (and is the only real way to support arbitrary protocols), but the moment you add STARTTLS you have a problem.

The problem is that STARTTLS intrinsically requires the input and the output to synchronize (and really you need them to merge). The moment you send STARTTLS, the previously independent to-network and from-network streams must be synchronized and yoked together as TLS exchanges messages with the server at the other end of the connection. In a program based around poll() or select(), this is easy; you can trivially shift from asynchronous IO streams back to synchronous ones while you set up TLS, and then go back to asynchronous streams over the TLS connection. But in a goroutine based system, it's far from clear how to achieve this (especially efficiently), for two reasons.

The first problem is simply establishing synchronization between the sending and receiving sides in some reasonably elegant way. One approach is some sort of explicit variable for 'starting TLS' that is written by the sending side and checked at every IO by the receiving side. In theory another approach is to switch to a model with more goroutines, with one for every IO source and sink and a central manager of some sort. The explicit variable seems not very Go-y, while the central manager raises the issue of how to keep slow IO output to either the network or stdout from blocking the other direction.

The second problem is that mere synchronization is not enough, because there is no way to wake up on incoming network IO without consuming some of it. This matters because when we send our STARTTLS, we want to immediately switch network input handling from its normal mode to 'bring up TLS' mode, without reading any of the pending network input. If we cannot switch right away, we will wake up having .Read() some amount of the STARTTLS response, which must then be handled somehow.

(Hopefully all STARTTLS-enabled protocols have a server response in plaintext that comes before any TLS protocol messages. If the first thing we may get back from the network in response to our STARTTLS is part of the server TLS handshake, I'd have to somehow relay that initial network data to tls.Client() as (fake) network input. The potential headaches there are big.)

I don't have any answers here and I don't know what the best solution is; I just have my usual annoyance that Go forces asynchronous (network) IO into a purely goroutine based architecture.

(The clear pragmatic answer is to not even try to support STARTTLS in call and to leave that to other tools, at least until I wind up with a burning need for it. But that kind of annoys me.)

GoStreamSynchronizationProblem written at 23:19:00; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.