Wandering Thoughts

2017-06-16

Go interfaces and automatically generated functions

I recently read Golang Internals Part 1: Autogenerated functions (and how to get rid of them) (via, and also), which recounts how Minio noticed an autogenerated function in their stack traces that was making an on-stack copy of a structure before calling a method, and worked out how to eliminate this function. Unfortunately, Minio's diagnosis of why this autogenerated function exists at all is not correct (although their solution is the right one). This matters partly because the reason why this autogenerated function exists exposes a real issue you may want to think about in your Go API design.

Let's start at the basics, which in this case is Go methods. Methods have a receiver, and this receiver can either be a value or a pointer. Your choice here of whether your methods have value receivers or pointer receivers matters for your API; see, for example, this article (via). Types also have a method set, which is simply all of the methods that they have. However, there is a special rule for the method sets of pointer types:

The method set of the corresponding pointer type *T is the set of all methods declared with receiver *T or T (that is, it also contains the method set of T).

(The corollary of this is that every regular type T implicitly creates a pointer type *T with all of T's methods, even if you never mention *T in your code or explicitly define any methods for it.)

It's easy to see how this works. Given a *T, you can always call a method on T by simply dereferencing your *T to get a T value, which means that it's trivial to write out a bunch of *T methods that just call the corresponding T methods:

func (p *T) Something(...) (...) {
  v := *p
  return v.Something(...)
}

Rather than require you to go through the effort of hand-writing all of these methods for all of your *T types, Go auto-generates them for you as necessary. This is exactly the autogenerated function that Minio saw in their stack traces; the underlying real method was cmd.retryStorage.ListDir() (which has a value receiver) and the autogenerated function was cmd.(*retryStorage).ListDir() (which has a pointer receiver, and which did the same dereferencing as our Something example).

But, you might ask, where does the *retryStorage pointer come from? The answer is that it comes from using interface types and values instead of concrete types and values. Here is the relevant bits of the cleanupDir() function that was one step up Minio's stack trace:

func cleanupDir(storage StorageAPI, volume, dirPath string) error {
  [...]
     entries, err := storage.ListDir(volume, entryPath)
  [...]
}

We're making a ListDir() method call on storage, which is of type StorageAPI. This is an interface type, and therefor storage is an interface value. As Russ Cox has covered in his famous article Go Data Structures: Interfaces, interface values are effectively two-pointer structures:

Interface values are represented as a two-word pair giving a pointer to information about the type stored in the interface and a pointer to the associated data.

When we create a StorageAPI interface value from an underlying retryStorage object, the interface value contains a pointer to the object, not the object itself. When we call a function that takes such an interface value as one of its arguments, we wind up passing it a *retryStorage pointer (among other things). As a result, when we call cleanupDir(), we're effectively creating a situation in the code like this:

type magicI struct {
  tab *_typeDef
  ptr *retryStorage
}

func cleanupDir(storage magicI, ...) error {
  [...]
    // we're trying to call (*retryStorage).ListDir()
    // since what we have is a pointer, not a value.
    entries, err := storage.ptr.ListDir(...)
  [...]
}

Since there is no explicit pointer receiver method (*retryStorage).ListDir() but there is a value receiver method retryStorage.ListDir(), Go calls the autogenerated (*retryStorage).ListDir() method for us (well, for Minio).

This points out an important general rule: calling value receiver methods through interfaces always creates extra copies of your values. Interface values are fundamentally pointers, while your value receiver methods require values; ergo every call requires Go to create a new copy of the value, call your method with it, and then throw the value away. There is no way to avoid this as long as you use value receiver methods and call them through interface values; it's a fundamental requirement of Go.

The conclusion for API design is clear but not necessarily elegant. If your type's methods will always or almost always be called through interface values, you might want to consider using pointer receiver methods instead of value receiver methods even if it's a bit unnatural. Using pointer receiver methods avoids both making a new copy of the value and doing an additional call through the autogenerated conversion method; you go straight to your actual method with no overhead. For obvious reasons, the larger your values are (in terms of the storage they require), the more this matters, because Go has to copy more and more bytes around to create that throwaway value for the method call.

(Of course if you have large types you probably don't want value receiver methods in the first place, regardless of whether or not they wind up being called through interface values. Value receiver methods are best for values that only take up modest amounts of storage, or at least that can be copied around that way.)

Sidebar: How Go passes arguments to functions at the assembly level

In some languages and runtime environments, if you call a function that takes a sufficiently large value as an argument (for example, a large structure), the argument is secretly passed by providing the called function with a pointer to data stored elsewhere instead of writing however many bytes into the stack. Large return values may similarly be returned indirectly (often into a caller-prepared area). At least today, Go is not such a language. All arguments are passed completely on the stack, even if they are large.

This means that Go must always dereference *T pointers into on-stack copies of the value in order to call value receiver T methods. Those T methods fundamentally require their arguments to be on the stack, nowhere else, and this includes the receiver itself (which is passed as a more or less hidden first argument, and things get complicated here).

GoInterfacesAutogenFuncs written at 23:48:43; Add Comment

2017-06-06

A humbling experience of misreading some simple (Go) code

Every so often, I get to have a humbling experience, sometimes in public and sometimes in private. Recently I was reading Go Range Loop Internals (via) and hit its link to this Damian Gryski (@dgryski) tweet:

Today's #golang gotcha: the two-value range over an array does a copy. Avoid by ranging over the pointer instead.

play.golang.org/p/4b181zkB1O

I ran the code on the playground, followed it along, and hit a 'what?' moment where I felt I had a mystery where I didn't understand why Go was doing something. Here is the code:

func IndexValueArrayPtr() {
  a := [...]int{1, 2, 3, 4, 5, 6, 7, 8}

  for i, v := range &a {
    a[3] = 100
    if i == 3 {
      fmt.Println("IndexValueArrayPtr", i, v)
    }
  }
}

Usefully, I have notes about my confusion, and I will put them here verbatim:

why is the IndexValueArrayPtr result '3 100'? v should be copied before a[3] is modified, and v is type 'int', not a pointer.

This is a case of me reading the code that I thought was there instead of the code that was actually present, because I thought the code was there to make a somewhat different point. What I had overlooked in IndexValueArrayPtr (and in fact in all three functions) is that a[3] is set on every pass through the loop, not just when i == 3.

Misreading the code this way makes no difference to the other two examples (you can see this yourself with this variant), but it's crucial to how IndexValueArrayPtr behaves. If the a[3] assignment was inside the if, my notes would be completely true; v would have copied the old value of a[3] before the assignment and this would print '3 4'. But since the assignment happens on every pass of the loop, a[3] has already been assigned to be 100 by the time the loop gets to the fourth element and makes v a copy of it.

(I think I misread the code this way partly because setting a[3] only once is more efficient and minimal, and as noted the other two functions still illustrate their particular issues when you do it that way.)

Reading an imaginary, 'idealized' version of the code instead of the real one is not a new thing and it's not unique to me, of course. When you do it on real code in a situation where you're trying to find a bug, it can lead to a completely frustrating time where you literally can't see what's in front of your eyes and then when you can you wonder how you could possibly have missed it for so long.

(I suspect that this is a situation where rubber duck debugging helps. When you have to actually say things out loud, you hopefully get another chance to have your brain notice that what you want to say doesn't actually correspond to reality.)

PS: The reason I have notes on my confusion is that I was planning to turn explaining this 'mystery' into a blog entry. Then, well, I worked out the mystery, so now I've gotten to write a somewhat different sort of blog entry on it.

GoMisreadingSomeCode written at 23:59:29; Add Comment

2017-05-31

Why one git fetch default configuration bit is probably okay

I've recently been reading the git fetch manpage reasonably carefully as part of trying to understand what I'm doing with limited fetches. If you do this, you'll run across an interesting piece of information about the <refspec> argument, including in its form as the fetch = setting for remotes. The basic syntax is '<src>:<dst>', and the standard version that is created by any git clone gives you:

fetch = +refs/heads/*:refs/remotes/origin/*

You might wonder about that + at the start, and I certainly did. Well, it's special magic. To quote the documentation:

The remote ref that matches <src> is fetched, and if <dst> is not empty string, the local ref that matches it is fast-forwarded using <src>. If the optional plus + is used, the local ref is updated even if it does not result in a fast-forward update.

(Emphasis mine.)

When I read this my eyebrows went up, because it sounded dangerous. There's certainly lots of complicated processes around 'git pull' if it detects that it can't fast-forward what it's just fetched, so allowing non-fast-forward fetches (and by default) certainly sounded like maybe it was something I wanted to turn off. So I tried to think carefully about what's going on here, and as a result I now believe that this configuration is mostly harmless and probably what you want.

The big thing is that this is not about what happens with your local branch, eg master or rel-1.8. This is about your repo's copy of the remote branch, for example origin/master or origin/rel-1.8. And it is not even about the branch, because branches are really 'refs', symbolic references to specific commits. git fetch maintains refs (here under refs/remotes/origin) for every branch that you're copying from the remote, and one of the things that it does when you fetch updates is update these refs. This lets the rest of Git use them and do things like merge or fast-forward remote updates into your local remote-tracking branch.

So git fetch's documentation is talking about what it does to these remote-branch refs if the branch on the remote has been rebased or rewound so that it is no longer a more recent version of what you have from your last update of the remote. With the + included in the <refspec>, git fetch always updates your repo's ref for the remote branch to match whatever the remote has; basically it overwrites whatever ref you used to have with the new ref from the remote. After a fetch, your origin/master or origin/rel-1.8 will always be the same as the remote's, even if the remote rebased, rewound, or did other weird things. You can then go on to fix up your local branch in a variety of ways.

(To be technical your origin/master will be the same as origin's master, but you get the idea here.)

This makes the + a reasonable default, because it means that 'git fetch' will reliably mirror even a remote that is rebasing and otherwise having its history rewritten and its branches changed around. Without the +, 'git fetch' might transfer the new and revised commits and trees from your remote but it wouldn't give you any convenient reference for them for you to look at them, integrate them, or just reset your local remote-tracking branch to their new state.

(Without the '+', 'git fetch' won't update your repo's remote-branch refs. I don't know if it writes the new ref information anywhere, perhaps to .git/FETCH_HEAD, or if it just throws it away, possibly after printing out commit hashes.)

Sidebar: When I can imagine not using a '+'

The one thing that using a '+' does is that it sort of allows a remote to effectively delete past history out of your local repo, something that's not normally possible in a DVCS and potentially not desirable. It doesn't do this directly, but it starts an indirect process of it and it certainly makes the old history somewhat annoying to get at.

Git doesn't let a remote directly delete commits, trees, and objects. But unreferenced items in your repo are slowly garbage-collected after a while and when you update your remote-branch refs after a non-ff fetch, the old commits that the pre-fetch refs pointed to start becoming more and more unreachable. I believe they live on in the reflog for a while, but you have to know that they're missing and to look.

If you want to be absolutely sure that you notice any funny business going on in an upstream remote that is not supposed to modify its public history this way, not using '+' will probably help. I'm not sure if it's the easiest way to do this, though, because I don't know what 'git fetch' does when it detects a non-ff fetch like this.

(Hopefully git fetch complains loudly instead of failing silently.)

GitFetchMagicPlus written at 00:44:13; Add Comment

2017-05-29

Configuring Git worktrees to limit what's fetched on pulls

Yesterday I wrote about my practical problem with git worktrees, which is to limit what is fetched from the remote when I do 'git pull' in one (as opposed to the main repo). I also included a sidebar with a theory on how to do this with some Git configuration madness. In a spirit of crazed experimentation I've now put this theory into practice and it appears to actually work. Unfortunately the way I know how to do this requires some hand editing of your .git/config, rather than using commands like 'git remote' to do this for you. However, I don't fully understand what I'm doing here (and that's one reason I'm putting in lots of notes to myself).

Here's my process:

  1. Create a new worktree as normal, based from the origin branch you want:

    git worktree add -b release-branch.go1.8 ../v1.8 origin/release-branch.go1.8
    

    Because we used -b, this will also create a local remote-tracking branch, release-branch.go1.8, that tracks origin's release-branch.go1.8 branch.

    If you already have a release-branch.go1.8 branch (perhaps you've checked it out in your main repo at some point or previously created a worktree for it), this is just:

    git worktree add ../v1.8 release-branch.go1.8
    

  2. Create a new remote for your upstream repo to fetch just this upstream branch:

    git remote add -t release-branch.go1.8 origin-v1.8 https://go.googlesource.com/go
    

    Because we set it up to track only a specific remote branch, 'git fetch' for this remote will only fetch updates for the remote's release-branch.go1.8 branch, even though it has the same URL as our regular origin remote (which will normally fetch all branches).

  3. Edit .git/config to change the fetch = line for origin-v1.8 to fetch the branch into refs/remotes/origin/release-branch.go1.8, which is the fetch destination for your origin remote. That is:

    fetch = +refs/heads/release-branch.go1.8:refs/remotes/origin/release-branch.go1.8
    

    By fetching into refs/remotes/origin like this, my understanding is that we avoid doing duplicate fetches. Whether we do 'git fetch' in our worktree or in the maste repo, we'll be updating the same remote branch reference and so we'll only fetch updates for this (remote) branch once. I believe that if you don't do this, 'git pull' or 'git fetch' in the worktree will always report the new updates; you'll never 'lose' an update for the branch by doing a 'git pull' in the master. However I think you may wind up doing extra transfers.

    (This can be done with git config but I'd rather edit .git/config by hand.)

  4. Edit .git/config again to change the 'remote =' line for your release-branch.go1.8 branch to be origin-v1.8 instead of origin.

    By forcing the remote for the branch, we activate git fetch's restriction on what remote branches will be fetched when we do a 'git pull' or 'git fetch' in a tree with that branch checked out (here, our worktree, but it could be the master repo).

    If you prefer, you can set this with 'git config' instead of by hand editing:

    git config branch.release-branch.go1.8.remote origin-v1.8
    

We can see that this works by comparing 'git fetch -v --dry-run' in the worktree and in the master repo. In the worktree, it will report just an attempt to update origin/release-branch.go1.8. In the master repo, it will (normally) report an attempt to update everything.

Because everything is attached to our branch configuration for the (local) release-branch.go1.8 branch, not the worktree, this will survive removing and then re-recreating the worktree. This may be a feature, or it may be a drawback, since it means that if you delete the worktree and check out release-branch.go1.8 in the master repo, 'git pull' will start only updating it (and not updating master and other branches as well). We can change back to the normal state of things by updating the remote for the branch back to the normal origin remote:

git config branch.release-branch.go1.8.remote origin

(In general you can flip the state of the branch back and forth as you want. I don't think Git gets confused, although you may.)

GitWorktreeLimitedPulling written at 22:42:43; Add Comment

2017-05-28

My thoughts on git worktrees for me (and some notes on things I tried)

I recently discovered git worktrees and did some experimentation with using them for stuff that I do. The short summary of my experience so far is that while I can see the appeal for certain sorts of usage cases, I don't think git worktrees are a good fit for my situation and I'm probably to use completely independent repositories in the future.

My usage case was building my own copies of multiple versions of some project, starting with Go. Especially in the case of a language compiler and its standard library, it's reasonably useful to have the latest development version plus a stable version or two; for example, it gives me an easy way to test if something I'm working on will build on older released versions or if I've let a dependency on some recent bit of the standard library creep in. The initial process of creating a worktree for, say, Go 1.8 is reasonably straightforward:

cd /some/where/go
git worktree add -b release-branch.go1.8 ../v1.8 origin/release-branch.go1.8

What proved tricky for me is updating this v1.8 tree when the Go people update Go 1.8, as they do periodically. My normal way of staying up to date on what changes are happening in the main line of Go is to do 'git pull' in my master repo directory, note the lines that get printed out about fetched updates, eg:

remote: Finding sources: 100% (64/64)
remote: Total 64 (delta 23), reused 64 (delta 23)
Unpacking objects: 100% (64/64), done.
From https://go.googlesource.com/go
   ffab6ab877..d64c49098c  master     -> origin/master

And then I use 'git log ffab6ab877..d64c49098c' to see what changed. The problem with worktrees is that this information is printed by 'git fetch', and normally 'git fetch' updates all branches, both the mainline and, say, a release branch you're following. So I actively don't want to run 'git pull' or 'git fetch' in the worktree directory, because otherwise I will have to remember to stop and look at the mainline updates it's just fetched and reported to me.

What I wound up doing was running 'git pull' in my main go tree and if there was an update to origin/release-branch.go1.8 reported, I'd go to my 'v1.8' directory and do 'git merge --ff-only'. This mostly worked (it blew up on me once for reasons I don't understand), but it means that dealing with a worktree is different than dealing with a normal Git repo directory (including an independently cloned repo). Since 'git pull' and other Git commands work 'normally' in a worktree, I have to explicitly remember that I created something as a worktree (or check to see if .git is a directory to know, since 'git status' doesn't helpfully tell you one way or the other).

(In my current moderate level of Git knowledge and experience, I'm going to avoid writing about the good usage cases I think I see for worktrees. Anyway, one of them is documented in the git-worktree manpage; I note that their scenario uses a worktree for a one-shot branch that's never updated from upstream.)

As mentioned, if I want to see if a particular Git repo is a worktree or not I need to do 'ls -ld .git'. If it's a file, I have a worktree. If I have a directory, with how I currently use Git, it's a full repo. 'git worktree list' will list the main repo and worktrees, but it doesn't annotate things with a 'you are here' marker. Obviously if I used worktrees enough I could write a status command to tell me, but then if I was doing that I could probably write a bunch of commands to do what I want in general.

Sidebar: Excessively clever Git configuration hacking (maybe)

Bearing in mind that I don't understand Git as much as I think I may, as far as I can see what branches 'git fetch' fetches are determined from the configuration for the remote for a branch, not from the branch's configuration. There appear to be two options for fiddling things here.

The 'obvious' option is to create a second remote (call it, say, 'v1.8-origin') with the same url as origin but a fetch setting that only fetches the particular branch:

fetch = refs/heads/release-branch.go1.8:refs/remotes/origin/release-branch.go1.8

Then I'd switch the remote for the release-branch.go1.8 branch to this new remote.

Git-fetch also has a feature where you can have a per-branch configuration in $GIT_DIR/branches/<branch>; this can be used to name the upstream 'head' (branch) that will be fetched into the local branch. It appears that creating such a file should do the trick, but I can't find people writing about this on the Internet (just many copies of the git-fetch manpage), so I'm wary of assuming that I understand what's going to happen here. Plus, it's apparently a deprecated legacy approach.

(If I understand all of this correctly, either approach would preserve 'git pull' in the main repo (which is on the master branch) always fetching all branches from upstream.)

GitWorktreeThoughts written at 23:08:19; Add Comment

2017-05-12

Where bootstrapping Go with a modern version of Go has gotten faster

Since Go 1.5, building Go from source requires an existing 'bootstrap' Go compiler. For at least a while, the fastest previous Go version to use for this was Go 1.4, the last version written in C and also the version that generally compiled Go source code the fastest. When I wrote up my process of building Go from source, I discovered that using Go 1.7.5 or Go 1.8.1 was actually now a bit faster than using Go 1.4. I mentioned this on Twitter and because the general slowdown in how fast Go compiles code has been one of Dave Cheney's favorite issues, I tagged him in my Tweet. Dave Cheney found that result surprising, so I decided to dig more into the details by adding some crude instrumentation to the process of building Go from source.

Building Go from source has four steps, and this is how I understand them:

  1. ##### Building Go bootstrap tool.

    This builds cmd/dist. It uses your bootstrap version of Go.

  2. ##### Building Go toolchain using <bootstrap go>.

    This builds a bunch of 'bootstrap/*' stuff with cmd/dist, again using your bootstrap Go. My understanding is that this is a minimal Go compiler, assembler, and linker that omits various things in order to guarantee that it can be compiled under Go 1.4.

  3. ##### Building go_bootstrap for host, linux/amd64.

    I believe that this builds the go tool itself and various associated bits and pieces using the bootstrap/* compiler and so on built in step 2. In particular, this does not appear to rebuild the step 2 compiler with itself.

    (There is code to do this in cmd/dist, but it is deliberately disabled.)

  4. ##### Building packages and commands for linux/amd64.

    This builds and rebuilds everything; the full Go compiler, toolchain, go program and its sub-programs, and the entire standard library. I believe it uses the go program from step 3 but the compiler, assembler, and linker from step 2.

If I'm understanding this correctly, this means that as late as step 4 you're still building Go code using a compiler compiled by your initial bootstrap compiler, such as Go 1.4. However, you're using the current Go compiler from stage 3 onwards, not the bootstrap compiler itself; the stage 2 code is the last thing compiled by your bootstrap compiler (and so the last place its compilation speed matters).

So now to timings. I tested building an almost-current version of Go tip (it identifies itself as '+e5bb5e3') using three different bootstrap Go versions: Go 1.4, Go 1.8.1, and Go tip (+482da51). I timed things on a a quite powerful server with 96 GB of RAM, Xeon E5-2680 CPUs, and 32 (hyperthreaded) cores. On this server, using Go tip gives a make.bash time of about 24 seconds total, using Go 1.8.1 a time of about 28.5 seconds total, and Go 1.4 a total time of almost 40 seconds. But a more interesting question is where the time is going and which bootstrap compiler wins where:

  • For stage 1, Go 1.4 is still the fastest and Go 1.8.1 the slowest of the three. However this stage takes only a tiny amount of time.

  • For stage 2, Go tip is fastest, followed by Go 1.4, then Go 1.8.1. Go 1.4 uses by far the lowest 'user' time, so the other Go versions are covering up speed issues by using more CPUs.

  • For stage 3, Go tip is slightly faster than Go 1.8.1, and Go 1.4 is clearly third.

  • For stage 4, Go tip and Go 1.8.1 are tied and Go 1.4 is way behind, taking about twice as long (23 seconds versus 11.5 seconds).

My best guess at what is causing to Go 1.4 to be slower here is that it simply produces less optimized code than Go 1.8.1 and Go tip. As far as I can see, even the stage 4 compilation is still done using a Go compiler, assembler, and linker that were compiled with the bootstrap compiler, so if the bootstrap compiler produces slow code, they will run slower (despite all three bootstrap compilers compiling the same Go code). This is most visible in stage 4, because stage 4 (re)builds by far the most Go code. Go 1.4's compilation speed no longer helps here because we're not compiling with Go 1.4 itself; we're compiling with the 1.4-built but current (and thus generally slower) Go compiler toolchain.

(I think this explains why stage 3 and stage 4 are so close between Go 1.8.1 and Go tip; there probably is far less difference in code optimization between the two than between either and Go 1.4.)

Based on this, I would expect Go build times to be most clearly improved by a more recent bootstrap compiler on platforms with relatively bad code optimization in Go 1.4. My impression is that ARM may be one such platform.

If you're wondering why Go tip is so much faster than Go 1.8.1 on stage 2, the answer is probably the recently landed changes for Go issue #15756, 'cmd/compile: parallelize compilation'. As of this commit, concurrent backend compilation is enabled by default in the Go tip. Some quick testing suggests that this is responsible for almost all of the speed advantage of Go tip over Go 1.8.1.

(If you want to test this, note that stage 3 and stage 4 will normally use this too, at least if you're testing by building a Go git version after this commit landed. I don't know of an easy way to disable concurrent compilation only in the bootstrap compiler.)

Sidebar: Typical real and user times, in seconds

Here is a little table of typical wall clock ('real') and user mode times, as reported by time, for building with various different bootstrap compilers. In each table cell, the real time is first, then the user time (which is almost always larger).

bootstrap: Go 1.4 Go 1.8.1 Go tip
stage 1 0.7 / 0.6 1.2 / 1.3 0.8 / 1.4
stage 2 6.6 / 9.8 9.1 / 19.4 4.8 / 19.2
stage 3 7.9 / 15.2 6.8 / 16.1 6.4 / 15.4
stage 4 24.4 / 75.9 11.2 / 84.8 11.6 / 84.5

(The stage 4 numbers between Go 1.8.1 and Go tip are too close to call from run to run. Possibly the stage 3 numbers are as well and I'm basically fooling myself to see a difference.)

Disclaimer: These numbers are not gathered with anything approaching statistical rigor, because I don't have that much energy and make.bash (and cmd/dist) don't make it particularly easy for an outsider to get this sort of data.

For my own memory, if nothing else, all builds were done with everything in /tmp, which is a RAID-0 stripe of two 500 GB Seagate Constellation ST9500620NS SATA drives. With 96 GB, I expect that basically all static data was in kernel disk buffers in RAM all the time, but some things may have been written to disk.

GoBuildWhereTimeGoes written at 02:39:24; Add Comment

2017-05-10

Building the Go compiler from source from scratch (on Unix)

Unlike some languages which are a real tedious pain to build from source, Go is both easy and interesting to build from source, even (and especially) for the latest development version. Building from source can be especially convenient if you want your own personal copy of a current version of Go (or the very latest version) on a system where you don't have permissions required to install system packages or write to /usr/local. I've seen various recipes for building Go this way, but here is the one I now recommend that you use, with some commentary on why I'm doing it this way.

First off, to build Go you need a working C compiler environment and a reasonably current version of git. Arranging for these is beyond the scope of these instructions; I'm just going to assume that you can build programs in general. Building current versions of Go also requires a working Go compiler, so the from scratch process of building Go from source needs another working Go compiler. The easiest and currently best source of this second Go compiler is a prebuilt pacakge from the Go people.

My process goes like this:

  1. Make a bootstrap area that you'll use for the bootstrap Go compiler, and fetch the latest prebuilt Go 1.8 package from the official Go downloads area:

    mkdir bootstrap
    cd bootstrap
    wget https://.../<whatever>.tar.gz
    tar -xf <whatever>.tar.gz
    

    You specifically want Go 1.8 (1.8.1 as I write this) because Go compile times took a nose dive from Go 1.5 onwards (the first version of the compiler that was written in Go instead of C) and only recently recovered. It used to be clearly slower to bootstrap Go with versions of Go from 1.5 onwards, but it's now actually slightly faster to do so with Go 1.8.1 instead of with Go 1.4, at least on 64-bit Linux x86.

    (I wound up testing this as part of writing this entry and surprised myself. I used to use Go 1.4 as the bootstrap compiler; I'm now switching to Go 1.8. A quick test suggest that Go 1.7 is also slightly faster than Go 1.4 for this, but Go 1.8 is faster than Go 1.7 so you might as well use it.)

    If your system already has a system version of Go 1.8, you can use that. If the latest version of Go is more recent than Go 1.8 (on your system or released by the Go people or both), it might be better for this. Go 1.9 is probably going to compile Go programs faster than Go 1.8, but predicting the future beyond it is hard.

  2. Get a Git clone of the current master repository:

    cd /some/where
    git clone https://go.googlesource.com/go go
    

  3. Create a little script to build your master version of Go using the version of Go in the bootstrap area; this script lives in go/src. I call my script make-all.bash, and a simple version looks like this:

    #!/bin/bash
    GOROOT_BOOTSTRAP=/some/where/bootstrap/go
    export GOROOT_BOOTSTRAP
    ./all.bash
    

    You can do this by hand but it gets to be a pain to remember the correct setting for $GOROOT_BOOTSTRAP and scripts capture knowledge.

    If you're using a system version of Go instead of your own bootstrap version, the $GOROOT_BOOTSTRAP setting you want is:

    GOROOT_BOOTSTRAP=$(/usr/bin/go env GOROOT)
    

    Or perhaps /usr/local/bin/go, or even /usr/local/go/bin/go.

  4. Build the latest version of Go with this script:

    cd go/src
    ./make-all.bash
    

    You can now add /some/where/go/bin to your path, or symlink the programs there into $HOME/bin if you prefer.

    (As with most compilers, Go does a two-stage build; first it builds itself with your bootstrap Go, and then it rebuilds itself with itself.)

When you want to (re)build the latest version of Go, you simply 'git pull' to update the master tree and then repeat step four.

Future versions of Go will make all of this somewhat easier because they'll permit you to download prebuilt binaries but put them anywhere you want without hassles. Today, it requires somewhat awkward gyrations to download one of the distribution packages but not put it in /usr/local/go, which creates more than one reason to build your own version of Go from source.

Sidebar: Building specific versions of Go

Since the development tree sometimes breaks or has things in it that you don't actually want to use, you may also want to keep around your own copy of, say, the latest officially released Go version, which is Go 1.8.x as I write this. You can do this as a Git worktree derived from your master go repository:

cd /some/where/go
git worktree add -b release-branch.go1.8 ../v1.8 origin/release-branch.go1.8
cd ../v1.8/src
cp ../go/src/make-all.bash .
./make-all.bash

('git branch -r' in your go repo will be useful here. I believe this tree can be updated when the Go people release new updates for Go 1.8, although I'm not completely sure of the best Git way to do it.)

This is different from the binary release that you downloaded to /some/where/bootstrap/go, because it doesn't require any weird steps to use. You can just add /some/where/v1.8/bin at the start of your $PATH and then everything just works, unlike the bootstrap copy, which requires you to set $GOROOT to use it.

By the way, yes, once you build your own version of Go 1.8, you can use it as the bootstrap compiler for the latest development version of Go.

(Even more recursive setups are possible. My version of Go 1.8 that I'm now using as my bootstrap Go compiler was actually bootstrapped with the latest Go development version, because why not.)

GoBuildFromSource written at 03:17:33; Add Comment

2017-04-30

Some more feelings on nondeterministic garbage collection

A while back I wrote an entry about the problem with nondeterministic garbage collection, more or less as part of my views at the time on PyPy. In that entry I was fairly down on nondeterministic GC. I still feel more or less that way about PyPy's garbage collection. Yet at the same time I use and like Go (and I did back then), which very definitely has nondeterministic garbage collection, and I don't find it to be a problem or something that annoys me. When I was revisiting this recently, I found myself wondering what the difference is. Is it just that I like Go enough that I'm unconsciously forgiving it this?

I don't think it's that simple. Instead I think it comes down to what I could call the culture of the language but instead is better described as 'how people write code in practice'. CPython has always had a deterministic garbage collector with prompt garbage collection, and as a result people wrote plenty of code that assumes that behavior and will do various degrees of unfortunate things if it's run in an environment, like PyPy, that violates that assumption. In practice Python programmers have developed and routinely use plenty of idioms that more or less assume deterministic GC; this code may be 'incorrect' in some sense, but it's also common and normal.

(It is correct code for CPython in practice, in that it works and is efficient to write and so on.)

By contrast, Go had nondeterministic GC from the beginning and people have been coding with that in mind from the start. One partial consequence of this is that Go APIs are often carefully designed so that you can mostly avoid allocations if you want to go to the effort, with caller-supplied reusable buffers and so on. Writing such code is even pretty natural and obvious in Go, in a way that it isn't in Python. I'm pretty sure that Go's features, APIs, and coding style have all been shaped by it having nondeterministic GC, in ways that hasn't happened for Python because CPython had deterministic GC.

I also suspect that nondeterministic GC simply works better in a language that's explicitly designed to create less memory and object churn. Go has any number of language and compiler features that are partly designed to reduce memory pressure, things like unboxed array members, unboxed variables in general, and escape analysis (to enable cheap stack allocation of values).

(Static typing helps here too, but that's something that has reasons well beyond reducing memory pressure.)

PS: I don't have any directly comparable programs, but in operation this Go program seems to have about the same memory usage as this Python program, based on RSS. They aren't seeing the same load and don't quite do the same thing, but they're as close as I can get unless I get very energetic and rewrite DWiki in Go.

NondeterministicGCII written at 23:09:44; Add Comment

2017-04-27

Understanding Git's model versus understanding its magic

In a comment on my entry on coming to a better understanding of what git rebase does, Ricky suggested I might find Understanding Git Conceptually to be of interest. This provides me with an opportunity to talk about what I think my problem with mastering Git is.

It's worth quoting Charles Duan here:

The conclusion I draw from this is that you can only really use Git if you understand how Git works. Merely memorizing which commands you should run at what times will work in the short run, but it’s only a matter of time before you get stuck or, worse, break something.

I actually feel that I have a relatively good grasp of the technical underpinnings of Git, what many people would call 'how Git works'. To wave my hands a bit, Git is a content addressable store that is used to create snapshots of trees, which are then threaded together in a sequence with commits, and so on and so forth. This lets me nod and go 'of course' about any number of apparently paradoxical things, such as git repositories with multiple initial commits. I don't particularly have this understanding because I worked for it; instead, I mostly have it because I happened to be standing around in the right place at the right time to see Git in its early days.

(There are bits of git that I understand less about the technicalities, like the index. I have probably read a description of the guts of the index at least a few times, but I couldn't tell you off the top of my head how even a simple version of the index works at a mechanical level. It turns out to be covered in this StackOverflow answer; the short version is that the index is a composite of a directory file and a bunch of normal object blobs.)

But in practice Git layers a great deal of magic on top of this technical model of its inner workings. Branches are references to commits (ie, heads) and git advances the reference when you make commits under the right circumstances; simple. Except that some branches have 'upstreams' and are 'remote tracking branches' and so on. All of these pieces of magic are not intrinsic to the technical model (partly because the technical model is a strictly local one), but they are very important for working with Git in many real situations.

It is this magic that I haven't mastered and internalized. For example, I understand what 'git fetch' does to your repository, and I can see why you would want it to update certain branch references so you can find the newly imported commits. But I have to think about why 'git fetch' will update certain branches and not others, and I don't know off the top of my head the settings that control this or how you change them.

It's possible that Git has general patterns in this sort of magic, the way it has general patterns at its technical level. If it does, I have not yet understood enough of the magic to have noticed the general patterns. My personal suspicion is that general patterns do not necessarily exist at this layer, because the commands and operations I think of as part of this layer are actually things that have accreted into Git over time and were written by different people.

(At one point Git had a split between 'porcelain' and 'plumbing', where porcelain was the convenient user interface and was at least partially developed by different people than the core 'plumbing'. And bits of porcelain were developed by different people who had their own mental models for how their particular operation should behave, with git rebase's lack of an option for the branch name of the result being an example.)

In a way my understanding of Git's internals has probably held me back with Git in general, because it's helped to encouraged me to have a lackadaisical attitude about learning Git in general. The result is that I make little surgical strikes on manpages and problems, and once I feel I've solved them well enough I go away again. In this I've been mirroring one of the two ways that I approach new programming languages. I've likely reached the point in Git where I should switch over to thoroughly slogging through some parts of it; one weakness that's become obvious in writing this entry is basically everything to do with remote repositories.

GitCoreVersusMagic written at 01:10:48; Add Comment

2017-04-26

Coming to a better understanding of what git rebase does

Although I've used it reasonably regularly, git rebase has so far been a little bit magical to me, as you may be able to tell from my extensive explanation to myself of using it to rebase changes on top of an upstream rebase. In my grand tradition, I'm going to write down what I hope is a better understanding of what it does and how its arguments interact with that.

What git rebase does is that it takes a series of commits, replays them on top of some new commit, and then gives the resulting top commit a name so that you can use it. When you use the three argument form with --onto, you are fully specifying all of these. Take this command:

git rebase --onto muennich/master old-muennich master

--onto names the new commit everything will be put onto (usually it's a branch, as it is here), the series of commits that will be replayed is old-muennich..master, and the new name is also master. You don't get a choice about the new name; git rebase always makes your new rebase into your branch, discarding the old value of the branch.

(As far as I can tell there's no technical reason why git rebase couldn't let you specify the branch name of the result; it's just not in the conceptual model the authors have of how it should work. If you need this, you need to manually create a new branch beforehand.)

The minimal version has no arguments:

git rebase

This only works on branches with an upstream. It replays your commits from the current branch on top of the current (ie new) upstream, and it determines the range of commits to rebase roughly by finding the closest common ancestor of your commits and the upstream:

A -> B -> C -> D               [origin/master]
      \-> local-1 -> local-2   [master]

In this bad plain text diagram, the upstream added C and D while you have local-1 and local-2. The common point is B, and so B..master describes the commits that will be put on top of origin/master and then your master branch will be switched to them (well, the new version of them).

A rebase is conceptually a push to cherry-pick's pull. In cherry picking, you start on the new clean branch and pull in changes from elsewhere. In rebasing, you start on your 'dirty' local branch and push its changes on top of some other (clean) branch. You then keep the name of your local branch but not its old origin point.

If you use the one or two argument form of git rebase, you're explicitly telling rebase what to consider the 'upstream' for both determining the common ancestor commit and for what to put your changes on top of. If I'm understanding this correctly, the following commands are both equivalent to a plain 'git rebase' on your master branch:

git rebase origin/master
git rebase origin/master master

Based on the diagrams in the git-rebase manpage, it looks like the one and two argument forms are most useful for cases where you have multiple local branches and want to shuffle around the relationship between them.

In general the git-rebase manpage has helpful examples combined with extensive ASCII diagrams. If I periodically read it carefully whenever I'm confused, it will probably all sink in eventually.

(Of course, the git manual page that I actually should read carefully several times until it all sinks in and sticks is the one on specifying revisions and ranges for Git. I sort of know what a number of the different forms mean, but in practice it's one part folklore to one part actual knowledge.)

GitRebaseUnderstanding written at 02:01:28; Add Comment

(Previous 10 or go back to April 2017 at 2017/04/23)

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.