Wandering Thoughts archives

2014-09-30

Don't split up error messages in your source code

Every so often, developers come up with really clever ways to frustrate system administrators and other people who want to go look at their code to diagnose problems. The one that I ran into today looks like this:

if (rval != IDM_STATUS_SUCCESS) {
	cmn_err(CE_NOTE, "iscsi connection(%u) unable to "
	    "connect to target %s", icp->conn_oid,
	    icp->conn_sess->sess_name);
	idm_conn_rele(icp->conn_ic);
}

In the name of keeping the source lines under 80 characters wide, the developer here has split the error message into two parts, using modern C's constant string concatenation to have the compiler put them back together.

Perhaps it is not obvious why this is at least really annoying. Suppose that you start with the following error message in your logs:

iscsi connection(60) unable to connect to target <tgtname>

You (the bystander, who is not a developer) would like find the code that produces this error message, so that you can understand the surrounding context. If this error message was on one line in the code, it would be very easy to search for; even if you need to wild-card some stuff with grep, the core string 'unable to connect to target' ought to be both relatively unique and easy to find. But because the message has been split onto multiple source lines, it's not; your initial search will fail. In fact a lot of substrings will fail to find the correct source of this message (eg 'unable to connect'). You're left to search for various substrings of the message, hoping both that they are unique enough that you are not going to be drowned in hits and that you have correctly guessed how the developer decided to split things up or parameterize their message.

(I don't blame developers for parameterizing their messages, but it does make searching for them in the code much harder. Clearly some parts of this message are generated on the fly, but are 'connect' or 'target' among them instead of being constant part of the message? You don't know and have to guess. 'Unable to <X> to <Y> <Z>' is not necessarily an irrational message format string, or you equally might guess 'unable to <X> to target <Z>'.)

The developers doing this are not making life impossible for people, of course. But they are making it harder and I wish they wouldn't. It is worth long lines to be able to find things in source code with common tools.

(Messages aren't the only example of this, of course, just the one that got to me today.)

DontBreakUpMessages written at 00:33:12; Add Comment

2014-09-22

Go is mostly easy to cross-compile (with notes)

One of the things I like about Go is that it's generally very easy to cross-compile from one OS to another; for instance, I routinely build (64-bit) Solaris binaries from my 64-bit Linux instead of having to maintain a Solaris or OmniOS Go compilation environment (and with it all of the associated things I'd need to get my source code there, like a version of git and so on). However when I dug into the full story in order to write this entry, I discovered that there are some gaps and important details.

So let's start with basic cross compilation, which is the easy and usual bit. This is well covered by eg Dave Cheny's introduction. The really fast version looks like this (I'm going to assume a 64-bit Linux host):

cd /some/where
hg clone https://code.google.com/p/go
cd go/src
./all.bash
export GO386=sse2
GOOS=linux GOARCH=386 ./make.bash --no-clean
GOOS=solaris GOARCH=amd64 ./make.bash --no-clean
GOOS=freebsd GOARCH=386 ./make.bash --no-clean

(See Installing Go from source for a full discussion of these environment variables.)

With this done, we can build some program for multiple architectures (and deploy the result to them with just eg scp):

cd $HOME/src/call
go build -o call
GOARCH=386 go build -o call32
GOOS=solaris GOARCH=amd64 go build -o call.solaris

(Add additional architectures to taste.)

This generally works. I've done it for quite some time with good success; I don't think I've ever had such a cross-compiled binary not work right, including binaries that do network things. But, as they say, there is a fly in the ointment and these cross-compiled binaries are not quite equivalent to true natively compiled Go binaries.

Go cross-compilation has one potentially important limit: on some platforms, Linux included, true native Go binaries that use some packages are dynamically linked into the C runtime shared library and some associated shared libraries through Cgo (see also). On Linux I believe that this is necessary to use the true native implementation of anything that uses NSS; this includes hostname lookup, username and UID lookup, and group lookups. I further believe that this is because the native versions of this use dynamically loaded C shared libraries that are loaded by the internals of GNU libc.

Unfortunately, Cgo does not cross-compile (even if you happen to have a working C cross compiler environment on your host, as far as I know). So if you cross-compile Go programs to such targets, the binaries run but they have to emulate the native approach and the result is not guaranteed to give you identical results. Sometimes it won't work at all; for example os/user is unimplemented if you cross-compile to Linux (and all username or UID lookups will fail).

(One discussion of this is in Alan Shreve's article, which was a very useful source for writing this entry.)

Initially I thought this was no big deal for me but it turns out that it potentially is, because compiling for 32-bit Linux on 64-bit Linux is still cross-compiling (as is going the other way, from 32-bit host to 64-bit target). If you build your Go environment on, say, a 64-bit Ubuntu machine and cross-compile binaries for your 32-bit Ubuntu machines, you're affected by this. The sign of this happening is that ldd will report that you have a static executable instead of a dynamic one. For example, on 64-bit Linux:

; ldd call32 call64
call32:
        not a dynamic executable
call64:
        linux-vdso.so.1 =>  (0x00007ffff2957000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f3be5111000)
        libc.so.6 => /lib64/libc.so.6 (0x00007f3be4d53000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f3be537a000)

If you have both 64-bit and 32-bit Linux and you want to build true native binaries on both, at least as far as the standard packages go, you have to follow the approach from Alan Shreve's article. For me, this goes like the following (assuming that you want your 64-bit Linux machine to be the native version, which you may not):

  1. erase everything from $GOROOT/bin and $GOROOT/pkg
  2. run 'cd src; ./make.bash' on the 32-bit machine
  3. rename $GOROOT/pkg/linux_386 to some other name to preserve it
  4. build everything on your 64-bit machine, including the 32-bit cross-compile environment.
  5. delete the newly created $GOROOT/pkg/linux_386 directory hierarchy and restore the native-built version you saved in step 3.

If you're building from source using the exact same version from the Mercurial repository it appears that you can extend this to copying the pkg/$GOOS_$GOARCH directory between systems. I've tested copying both 32-bit Linux and 64-bit Solaris and it worked for me (and the resulting binaries ran correctly in quick testing). This means that you need to build Go itself on various systems but you can get away with doing all of your compilation and cross-compilation only on the most convenient system for you.

(I suspect but don't know that if you have any Cgo-using packages you can copy $GOPATH/pkg/$GOOS_$GOARCH around from system to system to get functioning native versions of necessary packages. Try it and see.)

Even with this road bump the pragmatic bottom line is that Go cross-compilation is easy, useful, and is probably going to work for your Go programs. It's certainly easy enough that you should give it a try just to see if it works for you.

GoCrossCompileNotes written at 23:32:46; Add Comment

2014-09-21

One reason why Go can have methods on nil pointers

I was recently reading an article on why Go can call methods on nil pointers (via) and wound up feeling that it was incomplete. It's hard to talk about 'the' singular reason that Go can do this, because a lot of design decisions went into the mix, but I think that one underappreciated reason this happens is because Go doesn't have inheritance.

In a typical language with inheritance, you can both override methods on child classes and pass a pointer to a child class instance to a function that expects a pointer to the parent class instance (and the function can then call methods on that pointer). This combination implies that the actual machine code in the function cannot simply make a static call to the appropriate parent class method function; instead it must somehow go through some sort of dynamic dispatch process so that it calls the child's method function instead of the parent's when passed what is actually a pointer to a child instance.

In non-nil pointers to objects, you have a natural place to put such a vtable (or rather a pointer to it) because the object has actual storage associated with it. But a nil pointer has no storage associated with it and so you can't naturally do this. That means given a nil pointer, how do you find the correct vtable? After all it might be a nil pointer of the child class that should call child class methods.

Because Go has no inheritance this problem does not come up. If your function takes a pointer to a concrete type and you call t.Method(), the compiler statically knows which exact function you're calling; it doesn't need to do any sort of dynamic lookup. Thus it can easily make this call even when given a nil. In effect the compiler gets to rewrite a call to t.Method() to something like ttype_Method(t).

But wait, you may say. What about interfaces? These have exactly the dynamic dispatch problem I was just talking about. The answer is that Go actually represents interface values that are pointers as two pointers; one which is the actual value and another points to (among other things) a vtable for the interface (which is populated based on the concrete type). Because Go statically knows that it is dealing with an interface instead of a concrete type, the compiler builds code that calls indirectly through this vtable.

(As I found out, this can lead to a situation where what you think is a nil pointer is not actually a nil pointer as Go sees it because it has been wrapped up as an interface value.)

Of course you could do this two-pointer trick with concrete types too if you wanted to, but it would have the unfortunate effect of adding an extra word to the size of all pointers. Most languages are not interested in paying that cost just to enable nil pointers to have methods.

(Go doesn't have inheritance for other reasons; it's probably just a happy coincidence that it enables nil pointers to have methods.)

PS: it follows that if you want to add inheritance to Go for some reason, you need to figure out how to solve this nil pointer with methods problem (likely in a way that doesn't double the size of all pointers). Call this an illustration of how language features can be surprisingly intertwined with each other.

GoNilMethodsWhy written at 01:38:12; Add Comment

2014-09-13

What can go wrong with polling for writability on blocking sockets

Yesterday I wrote about how our performance problem with amandad were caused by amandad doing IO multiplexing wrong by only polling for whether it could read from its input file descriptors and assuming it could always write to its network sockets. But let's ask a question: suppose that amandad was also polling for writability on those network sockets. Would it work fine?

The answer is no, not without even more code changes, because amandad's network sockets aren't set to be non-blocking. The problem here is what it really means when poll() reports that something is ready for write (or for that matter, for read). Let me put it this way:

That poll() says a file descriptor is ready for writes doesn't mean that you can write an arbitrary amount of data to it without blocking.

When I put it this way, of course it can't. Can I write a gigabyte to a network socket or a pipe without blocking? Pretty much any kernel is going to say 'hell no'. Network sockets and pipes can never instantly absorb arbitrary amounts of data; there's always a limit somewhere. What poll()'s readiness indicator more or less means is that you can now write some data without blocking. How much data is uncertain.

The importance of non-blocking sockets is due to an API decision that Unix has made. Given that you can't write an arbitrary amount of data to a socket or a pipe without blocking, Unix has decided that by default when you write 'too much' you get blocked instead of getting a short write return (where you try to write N bytes and get told you wrote less than that). In order to not get blocked if you try a too large write you must explicitly set your file descriptor to non-blocking mode; at this point you will either get a short write or just an error (if you're trying to write and there is no room at all).

(This is a sensible API decision for reasons beyond the scope of this entry. And yes, it's not symmetric with reading from sockets and pipes.)

So if amandad just polled for writability but changed nothing else in its behavior, it would almost certainly still wind up blocking on writes to network sockets as it tried to stuff too much down them. At most it would wind up blocked somewhat less often because it would at least send some data immediately every time it tried to write to the network.

(The pernicious side of this particular bug is whether it bites you in any visible way depends on how much network IO you try to do how fast. If you send to the network (or to pipes) at a sufficiently slow rate, perhaps because your source of data is slow, you won't stall visibly on writes because there's always the capacity for how much data you're sending. Only when your send rates start overwhelming the receiver will you actively block in writes.)

Sidebar: The value of serendipity (even if I was wrong)

Yesterday I mentioned that my realization about the core cause of our amandad problem was sparked by remembering an apparently unrelated thing. As it happens, it was my memory of reading Rusty Russell's POLLOUT doesn't mean write(2) won't block: Part II that started me on this whole chain. A few rusty neurons woke up and said 'wait, poll() and then long write() waits? I was reading about that...' and off I went, even if my initial idea turned out to be wrong about the real cause. Had I not been reading Rusty Russell's blog I probably would have missed noticing the anomaly and as a result wasted a bunch of time at some point trying to figure out what the core problem was.

The write() issue is clearly in the air because Ewen McNeill also pointed it out in a comment on yesterday's entry. This is a good thing; the odd write behavior deserves to be better known so that it doesn't bite people.

PollBlockingWritesBad written at 00:47:49; Add Comment

2014-09-12

How not to do IO multiplexing, as illustrated by Amanda

Every so often I have a belated slow-motion realization about what's probably wrong with an otherwise mysterious problem. Sometimes this is even sparked by remembering an apparently unrelated thing I read in passing. As it happens, that happened the other day.

Let's rewind to this entry, where I wrote about what I'd discovered while looking into our slow Amanda backups. Specifically this was a dump of amandad handling multiple streams of backups at once, which we determined is the source of our slowness. In that entry I wrote in passing:

[Amandad is] also spending much more of its IO wait time writing the data rather than waiting for there to be more input, although the picture here is misleading because it's also making pollsys() calls and I wasn't tracking the time spent waiting in those [...]

This should have set off big alarm bells. If amandad is using poll(), why is it spending any appreciable amount of time waiting for writes to complete? After all the whole purpose of poll() et al is to only be woken when you can do work, so you should spend minimal time blocked in the actual IO functions. The unfortunate answer is that Amanda is doing IO multiplexing wrong, in that I believe it's only using poll() to check for readability on its input FDs, not writability on its output FDs. Instead it tacitly assumes that whenever it has data to read it can immediately write all of this data out with no fuss, muss, or delays.

(Just checking for writability on the network connections wouldn't be quite enough, of course, but that's another entry.)

The problem is that this doesn't necessarily work. You can easily have situations where one TCP stream will accept much more data than another one, or where all (or just most) TCP streams will accept only a modest amount of data at a time; this is especially likely when the TCP streams are connected to different processes on the remote end. If one remote process stalls its TCP stream can stop accepting much more data, at which point a large write to the stream may stall in turn, which stalls all amandad activity even if it could succeed, which stalls upstream activity that is trying to send data to amandad. What amandad's handling of multiplexing does is to put all of the data streams flowing through it at the mercy of whatever is the slowest write stream at any given point in time. If anything blocks everything can block, and sooner or later we seem to wind up in a situation where anything can and does block. The result is a stuttering stop-start process full of stalls that reduces the overall data flow significantly.

In short, what you don't want in good IO multiplexing is a situation where IO on stream A has to wait because you're stalled doing IO on stream B, and that is just what amandad has arranged here.

Correct multiplexing is complex (even in the case of a single flow) but the core of it is never overcommitting yourself. What amandad should be doing is only writing as much output at a time as any particular connection can take, buffering some data internally, and passing any back pressure up to the things feeding it data (by stopping reading its inputs when it cannot write output and it has too much buffered). This would ensure that the multiplexing does no harm in that it can always proceed if any of the streams can proceed.

(Splitting apart the multiplexing into separate processes (or threads) does the same thing; because each one is only handling a single stream, that stream blocking doesn't block any other stream.)

PS: good IO multiplexing also needs to be fair, ie if if multiple streams or file descriptors can all do IO at the current time then all of them do some IO, instead of a constant stream of ready IO on one stream starving IO for other streams. Generally the easiest way to do this is to have your code always process all ready file descriptors before returning to poll(). This is also the more CPU-efficient way to do it.

IOMultiplexingDoneWrong written at 01:10:43; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.