2015-12-31
A defense of C's null-terminated strings
In certain circles that I follow, it's lately been popular to frown on C's choice of simple null-terminated strings. Now, there are certainly any number of downsides to C's choice of how to represent strings (including that it makes any number of string operations not all that efficient, since they need to scan the string to determine its length), but as it happens I think that C made the right choice for the language it was (and is), and especially for the environment it was in. Namely, C is a relatively close to machine code language, especially in the beginning, and it was used on very small machines.
The other real option for representing strings is for them to have an explicit length field at a known spot (usually the start), plus of course their actual data. This is perfectly viable in C (people write libraries to do this), but putting it in the language as the official string representation adds complications, some of them immediately obvious and others somewhat more subtle. So, let's talk about those complications as a way of justifying C's choice.
The obvious issue is how big you make the length field and possibly how you encode it. A fixed size length field commits you to a maximum string size (if you make it 32 bits you're limited to 4GB strings, for example) and also potentially adds extra overhead for short strings (of which there are many). If you use a variable sized length encoding, you can somewhat avoid overhead and size limits at the cost of more complications when you need to look at the size or work out how much memory a particular sized string needs. C's null byte terminator imposes a fixed and essentially minimal size penalty regardless of the length of the string.
The indirect issue is that moving to a length-first string representation essentially forces you to create two new complex types in the language. To see why this is so, let's write some code in a Go-like language with inferred types:
string s = "abcdef";void frob() { char *cp p := s+1 // refer to 'bcdef' cp = &s[2] // refer to 'cdef' if (*cp == 'c') { cp++; [...] } [...] }
Clearly s is a normal length plus data string; it takes up six
bytes plus however large the length field is. In the simple case
it is basically:
typedef string struct {
strsize_t _len
char _data[]
}
But:
- what type is
p, and what is its machine representation? - is the assignment to
cpvalid or an error? Iscpa simple pointer, or does it have a more complicated representation?
One possible answer is that p is a char * and that a
char * is a simple pointer. However, this is going to be very
inconvenient because it means that any time you refer to something
inside a string you lose the length of what you're dealing with and
need to carry it around separately. You don't want to make p a
string itself, because that would require allocating a variable
amount of space and copying string data when p is created. What
you need is what Go calls a slice type:
typedef strslice struct {
strsize_t _len
char *_dptr
}
This slice type carries with it the (adjusted) string length and a
pointer to the data, instead of actually containing the data. From
now on I'm going to assume that we have such a thing, because a
language without it is pretty inconvenient (in effect you wind up
making a slice type yourself, even if you just explicitly pass around
a char * plus a length).
(In the traditional C style, it's your responsibility to manage memory lifetime so you don't free the underlying string as long as slices still point to some of it.)
You could make this slice type be what char * is, but then
you'll want to create a new type that is a bare pointer to char,
without a string length attached to it. And you probably want such
a type, and you want to be able to get char * pointers from
strings so that you can do operations like stepping through strings
with maximum efficiency.
Our p type has a subtle inconvenience: because it fixes the string
length, a p reference to a string does not resize if the underlying
string has been extended (in the C tradition it doesn't notice if
the string has been shrunk so the p is now past the end, either).
Extending a string in place is a not uncommon C idiom, partly because
it is more efficient than constantly allocating and freeing memory;
it's a pity that our setup doesn't support that as it stands.
Since p is a multi-word type, we also have the problem of how to
pass it around when making function calls. We either want our neo-C
to have efficient struct passing or to always pass basic pointers
to p's, with called functions doing explicit copying if they want
to keep them. Oh, and clearly we're going to want a neo-C that has
good support for copying multi-word types, as it would be terrible
usability to have to write:
slice s2 memcpy(&s2, &p, sizeof(p))
(I already kind of assumed this when I wrote 'p := s+1'.)
The third level of issues is that this makes all of the str*
routines relatively special magic, because they must reach inside
this complex string type to manipulate the individual fields. We
will probably also want to make a lot of the str* functions
take slices (or pointers to slices) as arguments, which means either
implicitly or explicitly creating slices from (pointers to) strings
in various places in the code that people will write.
We could get around some of these problems by saying that the
string type is actually what I've called the slice type here, ie
all strings contain a pointer to the string data instead of directly
embedding it. But then this adds the overhead of a pointer to all
strings, even constant strings.
(It doesn't require separate memory allocations, since you can just put the string data immediately after the slice structure and allocate both at once.)
You can make all of this work and modern languages do (as do those string libraries for C). But I think it's clear that a neo-C with explicit-length strings would have been significantly more complicated from the get go than the C that we got, especially very early on in C's lifetime. Null terminated strings are the easiest approach and they require the least magic from the language and the runtime. Explicit length strings would also have been less memory efficient and more demanding on what were very small machines, ones where every byte might count.
(They are not just more demanding in data memory, they require more code to do things like maintain and manipulate string and slice lengths. And this code is invisible, unless you make programmers code it all explicitly by eg not having a slice type.)
2015-12-25
I like Magit, the GNU Emacs package for Git
As part of slowly improving my GNU Emacs environment, I recently decided to experiment with magit, a very well regarded Emacs package for working with Git. I'm not attempting to adopt it as my primary interface to Git (I'm not a 'do everything from inside Emacs' person); instead, I have relatively specific situations where I was hoping magit would be the best way to work with Git. Although I haven't used it extensively yet, I can definitely say that Magit has been great in the areas where I was hoping that it would be.
For me, the big Magit feature and selling point is selectively
adding things to the Git index (that is, staging them before
committing them). Ordinary command line Git has 'git add -p', but
this is a relatively awkward interface that I have had blow up in
my face before. Magit offers a very powerful interface for selectively
staging changes; you can easily stage not just individual chunks
of the diff but even individual lines and collections of them. This
is a very powerful way of straightening out a tangle of changes
into logical changesets, even if they don't come out in separate
chunks.
Part of why Magit makes this relatively easy is that it gives you
a good live display of the staged and unstaged diffs as you are
working away. This means you get immediate feedback if something
looks wrong (or incomplete), and you can also reverse a decision
right away. With 'git add -p' I found that I didn't really get
this, since it's pretty much a linear process through all the chunks
of diffs. In fact being non-linear is another advantage of Magit
here; you can see the full set of diffs and skip around it to spot
bits that you want to commit first.
Magit also has a relatively nice interface for just making commits. You get an Emacs buffer to edit the commit message (with standard general Emacs features like spell checking) and it shows you the staged diff that will be committed at the same time, which is handy. I can pretty much get all of this at the command line with multiple windows, but if I'm already in Emacs for editing files I often might as well commit from Emacs as well.
I have much less exposure to other Magit features, at least some of which I'm not terribly interested in (at least right now, I may change my mind later). How you invoke Magit commands initially struck me as a bit weird and confusing, but I'm getting used to it with time. People who regularly use Magit will likely find it completely natural.
So, in summary: Magit rocks for selectively staging changes, so much so that I think it's worth turning to GNU Emacs with Magit for this if you need it, even if you regularly edit in something else. If you regularly use GNU Emacs anyways I think the other parts of Magit are useful enough to learn at least the basics of looking at diffs and making straightforward commits.
(Although I spend a lot of time in vi, GNU Emacs remains my favorite editor for working on code in most languages for reasons beyond the scope of this entry.)
2015-12-14
Some things that force Go to call the C library for name resolution on Linux
Traditionally on Unix systems there is no official standard for how to do various sorts of network name lookups, or rather the official standard is 'call the functions in the system C library'. There is generally no actual specification for how name lookup works at the level that would permit you to create an independent implementation (although there is generally documentation on how to configure it). This presents a problem for people who are creating non-C languages; they must either arrange to call the C library (through the C library's limited interfaces for this) or write their own versions that may not resolve things exactly like the C library does, so that you get inconsistent behavior between C programs and programs written in the other language.
Go kind of takes both approaches. As covered in the net package's documentation, Go can both call the C library routines (using cgo) and do its own name lookups in pure Go. Go normally tries to use the pure Go approach as much as possible because it's considered better (partly because cgo calls can be relatively expensive). In theory the pure Go approach should give you the same results as the cgo approach; in practice, the two can behave somewhat differently in some situations, sometimes because of oversights.
(Although the net package's documentation talks only about DNS
related lookups, this also affects how at least net.LookupPort()
works.)
Go normally attempts to be pretty hyper-intelligent about whether or
not it can use its pure Go lookup functions. It makes this decision
in part by reading through your /etc/resolv.conf and /etc/nsswitch.conf
to see if you're using anything that it doesn't think it can handle.
This raises the question of what things in either of these files
can accidentally force Go to use cgo calls to the C library, instead
of its own more efficient (and more consistent across systems) pure
Go version. For /etc/resolv.conf, Go understands all of the common
things but anything else will force it to cgo, including any mistakes
you may have lurking in there. For /etc/nsswitch.conf, Go looks at
the 'hosts' line and a few complications can be common on modern
Linuxes:
- if your
hostsincludesmyhostname, only lookups of names with dots in them can be done in pure Go. Because of an implementation quirk, this currently means thatnet.LookupPort()is forced to use the C library.(Some other things are also forced to use the C library, but arguably they should in this situation because they involve hostnames.)
- if your
hostsincludesmymachines, all lookups go to the C library. This is probably common on modern systemd-based Linux distributions.
If you're using Go programs and you don't use containers or don't
need the magic functionality of mymachines, you may want to
strip it out of your nsswitch.conf. If you're like me, you may
even be surprised to find it there in the first place. You may not
want myhostname either, especially if your host has IP aliases
that are most definitely not included in what a name to IP lookup
for its hostname should return.
Note that contrary to what you might think, net.LookupPort() (and
things that call it to get ports, like net.ResolveTCPAddr()) does
not look at the services line in /etc/nsswitch.conf, only the
hosts line. And of course the pure Go port lookup only looks at
/etc/services (and may not parse it exactly like the C library
does). At the moment a missing or unreadable /etc/services seems
not to force a fallback to the C library but instead uses a tiny
built in default version.
(This probably doesn't affect anyone. I can't imagine that there
are very many people who use NIS or otherwise do something funny
for services lookups and not hosts lookups.)
You can see what sort of lookup is used for specific queries by
setting the GODEBUG environment variable to a verbosity of 2 or
more, ie 'GODEBUG=netdns=2'. The resulting report may look something
like this:
go package net: dynamic selection of DNS resolver go package net: hostLookupOrder() = cgo go package net: hostLookupOrder(smtp.cs) = files,dns
This is on a Linux machine where I set hosts to include myhostname
but not mymachines. The first hostLookupOrder() is for looking
up the port number for smtp; here the presence of myhostname
forced it to resort to cgo. A blank argument to hostLookupOrder()
is used by several sorts of lookups, including net.LookupAddr()
and net.LookupCNAME().
(This is of course an implementation detail and may change at some point.)
2015-12-09
Goroutines, network IO streams, and the resulting synchronization problem
These days, I use my Go take on a netcat-like program for all of my casual network
(TCP) service testing. One of the things that makes it convenient
is that it supports TLS, so I can just do things like 'call tls
imap.cs imaps' instead of having to dabble in yet another netcat-like
program. However, this TLS support has an important and unfortunate
limit right now, which is that it only supports protocols where you
do TLS from the start. A certain number of protocols support a
during-connection upgrade to TLS, generally via some form of what
gets called STARTTLS;
the one most relevant to me is SMTP.
I would like to support STARTTLS in call, but this exposes a structural issue. Call is a netcopy program and these are naturally asynchronous in that copying from standard input to the network is (and often should be) completely decoupled from copying from the network to standard output. Call implements this in the straightforward Go way by using two goroutines, one for each copy. This works quite well normally (and is the only real way to support arbitrary protocols), but the moment you add STARTTLS you have a problem.
The problem is that STARTTLS intrinsically requires the input and
the output to synchronize (and really you need them to merge). The
moment you send STARTTLS, the previously independent to-network and
from-network streams must be synchronized and yoked together as TLS
exchanges messages with the server at the other end of the connection.
In a program based around poll() or select(), this is easy; you
can trivially shift from asynchronous IO streams back to synchronous
ones while you set up TLS, and then go back to asynchronous streams
over the TLS connection. But in a goroutine based system, it's far
from clear how to achieve this (especially efficiently), for two
reasons.
The first problem is simply establishing synchronization between the sending and receiving sides in some reasonably elegant way. One approach is some sort of explicit variable for 'starting TLS' that is written by the sending side and checked at every IO by the receiving side. In theory another approach is to switch to a model with more goroutines, with one for every IO source and sink and a central manager of some sort. The explicit variable seems not very Go-y, while the central manager raises the issue of how to keep slow IO output to either the network or stdout from blocking the other direction.
The second problem is that mere synchronization is not enough,
because there is no way to wake up on incoming network IO without
consuming some of it. This matters because when we send our STARTTLS,
we want to immediately switch network input handling from its
normal mode to 'bring up TLS' mode, without reading any of the
pending network input. If we cannot switch right away, we will wake
up having .Read() some amount of the STARTTLS response, which
must then be handled somehow.
(Hopefully all STARTTLS-enabled protocols have a server response
in plaintext that comes before any TLS protocol messages. If the
first thing we may get back from the network in response to our
STARTTLS is part of the server TLS handshake, I'd have to somehow
relay that initial network data to tls.Client() as (fake) network
input. The potential headaches there are big.)
I don't have any answers here and I don't know what the best solution is; I just have my usual annoyance that Go forces asynchronous (network) IO into a purely goroutine based architecture.
(The clear pragmatic answer is to not even try to support STARTTLS in call and to leave that to other tools, at least until I wind up with a burning need for it. But that kind of annoys me.)