Wandering Thoughts archives

2015-09-07

Getting gocode based autocompletion working for Go in GNU Emacs

The existing guides and documentation for this are terrible and incomplete for someone who is not already experienced with GNU Emacs Lisp packages (which describes me), so here is what worked for me. I'm going to assume that your $GOPATH is $HOME/go and that $GOPATH/bin is on your $PATH.

  • get gocode itself:
    go get github.com/nsf/gocode
    

    To work in GNU Emacs, gocode needs an auto-completion package; it recommends auto-complete, so that's what I decided to use. If you have that already you're done, but I didn't. At this point you might be tempted to go to the auto-complete website and try to follow directions from there, but you actually don't want to do this because there's an easier way to install it.

  • The easiest way to install auto-complete and its prerequisites is through MELPA, which is an additional package repo for Emacs Lisp packages on top of the default ELPA. To enable MELPA, you need to add a stanza to your .emacs following its getting started guide, generally:

    (require 'package)
    (add-to-list 'package-archives
      '("melpa" . "https://melpa.org/packages/"))
    (package-initialize)
    

  • make sure you have a $HOME/.emacs.d directory. You probably do.

  • (re)start Emacs and run M-x list-packages. Navigate to auto-complete and get it installed. If you're running GNU Emacs in X, you can just click on its name and then on the [Install] button; if you're running in a terminal window, navigating to each thing and then hitting Return on it does the same. This will install auto-complete and its prerequisite package popup, the latter of which is not mentioned on the auto-complete site.

    It's possible to install auto-complete manually, directly from the site or more accurately from the github repo's release page. Do it from MELPA instead; it's easier and less annoying. If you install manually you'll have to use MELPA to install popup itself.

  • Set up the .emacs stanza for gocode:

    (add-to-list 'load-path "~/go/src/github.com/nsf/gocode/emacs")
    (require 'go-autocomplete)
    (require 'auto-complete-config)
    (ac-config-default)
    

    This deliberately uses the go-autocomplete.el from gocode's Go package (and uses it in place), instead of one you might get through eg MELPA. I like this because it means that if (and when) I update gocode, I automatically get the correct and latest version of its Emacs Lisp as well.

Restarting GNU Emacs should then get you autocompletion when writing Go code. You may or may not like how it works and want to keep it; I haven't made up my mind yet. Its usage in X appears to be pretty intuitive but I haven't fully sorted out how it works in text mode (the major way seems to be hitting TAB to cycle through possible auto-completions it offers you).

(Plenty of people seem to like it, though, and I decided I wanted to play with the feature since I've never used a smart IDE-like environment before.)

See also Package management in Emacs: The Good, the Bad, and the Ugly and Emacs: How to Install Packages Using ELPA, MELPA, Marmalade. There are probably other resources too; my Emacs inexperience is showing here.

(As usual, I've written this because if I ever need it again I'll hate myself for not having written it down, especially since the directions here are the result of a whole bunch of missteps and earlier inferior attempts. The whole messy situation led to a Twitter rant.)

GoGocodeEmacsAutocomplete written at 21:34:18; Add Comment

2015-09-03

How I've decided to coordinate multiple git repos for a single project

I'm increasingly using git for my own projects (partly because I keep putting them on Github), and this has brought up a problem. On the one hand, I like linear VCS histories (even if they're lies); I don't plan on having branches be visible in the history of my own repos unless it's clearly necessary. On the other hand, I routinely have multiple copies of my repos spread across multiple machines. In theory I always keep all repos synchronized with each other before I start working in one and make commits. In practice, well, not necessarily, and the moment I screw that up a straightforward git pull/push workflow to propagate changes around creates merges.

My current solution goes like this. First, I elect one repo as the primary repo; this is the repo which I use to push changes to Github, for example. To avoid merge commits ever appearing in it, I set it to only allow fast-forward merges when I do 'git pull', with:

git config pull.ff only

This insures that if the primary repo and a secondary repo wind up with different changes, a pull from the secondary into the primary will fail instead of throwing me into creating a merge commit that I don't want. To avoid creating merge commits when I pull the primary into secondaries, all other repos are set to rebase on pulls following my standard recipe. This is exactly what I want; if I pull new changes from the primary into a secondary, any changes in the secondary are rebased on top of the primary's stuff and linear history is preserved. I can then turn around and pull the secondary's additional changes back into the primary as a fast-forward.

If I use 'git push' to move commits from one repo to another I'm already safe by default, because git push normally refuses to do anything except fast-forward updates of the remote. If it complains, the secondary repo involved needs a rebase. I can either do the rebase with 'git pull' in the secondary repo, or in the primary repo I can push to the remote tracking branch in the secondary with 'git push <machine>:<directory> master:origin/master' and then do a 'git rebase' on the secondary.

(Using a push from the primary usually means that my ssh activity flows the right way. And if I'm pushing frequently I should configure a remote for the secondary or something. I'm not quite hep on git repo remotes and remote tracking branches just yet, though, so that's going to take a bit of fumbling around when I get to it.)

GitMultiRepoWorkflow written at 00:55:03; Add Comment

2015-08-31

CGo's Go string functions explained

As plenty of its documentation will tell you, cgo provides four functions to convert between Go and C types by making copies of the data. They are tersely explained in the CGo documentation; too tersely, in my opinion, because the documentation only covers certain things by implication and omits two very important glaring cautions. Because I made some mistakes here I'm going to write out a longer explanation.

The four functions are:

func C.CString(string) *C.char
func C.GoString(*C.char) string
func C.GoStringN(*C.char, C.int) string
func C.GoBytes(unsafe.Pointer, C.int) []byte

C.CString() is the equivalent of C's strdup() and copies your Go string to a C char * that you can pass to C functions, just as documented. The one annoying thing is that because of how Go and CGo types are defined, calling C.free will require a cast:

cs := C.CString("a string")
C.free(unsafe.Pointer(cs))

Note that Go strings may contain embedded 0 bytes and C strings may not. If your Go string contains one and you call C.CString(), C code will see your string truncated at that 0 byte. This is often not a concern, but sometimes text isn't guaranteed to not have null bytes.

C.GoString() is also the equivalent of strdup(), but for going the other way, from C strings to Go strings. You use it on struct fields and other things that are declared as C char *'s, aka *C.char in Go, and (as we'll see) pretty much nothing else.

C.GoStringN() is the equivalent of C's memmove(), not to any normal C string function. It copies the entire length of the C buffer into a Go string, and it pays no attention to null bytes. More exactly, it copies them too. If you have a struct field that is declared as, say, 'char field[64]' and you call C.GoStringN(&field, 64), the Go string you get will always be 64 characters long and will probably have a bunch of 0 bytes at the end.

(In my opinion this is a bug in cgo's documentation. It claims that GoStringN takes a C string as the argument, but it manifestly does not, as C strings are null-terminated and GoStringN does not stop at null bytes.)

C.GoBytes() is a version of C.GoStringN() that returns a []byte instead of a string. Since it doesn't claim to be taking a C string as the argument, it's clearer that it is simply a memory copy of the entire buffer.

If you are copying something that is not actually a null terminated C string but is instead a memory buffer with a size, C.GoStringN() is exactly what you want; it avoids the traditional C problem of dealing with 'strings' that aren't actually C strings. However, none of these functions are what you want if you are dealing with size-limited C strings in the form of struct fields declared as 'char field[N]'.

The traditional semantics of a fixed size string field in structs, fields that are declared as 'char field[N]' and described as holding a string, is that the string is null terminated if and only if there is room, ie if the string is at most N-1 characters long. If the string is exactly N characters long, it is not null terminated. This is a fruitful source of bugs even in C code and is not a good API, but it is an API that we are generally stuck with. Any time you see such a field and the documentation does not expressly tell you that the field contents are always null terminated, you have to assume that you have this sort of API.

Neither C.GoString() nor C.GoStringN() deal correctly with these fields. Using GoStringN() is the less wrong option; it will merely leave you with N-byte Go strings with plenty of trailing 0 bytes (which you may not notice for some time if you usually just print those fields out; yes, I've done this). Using the tempting GoString() is actively dangerous, because it internally does a strlen() on the argument; if the field lacks a terminating null byte, the strlen() will run away into memory beyond it. If you're lucky you will just wind up with some amount of trailing garbage in your Go string. If you're unlucky, your Go program will take a segmentation fault as strlen() hits unmapped memory.

(In general, trailing garbage in strings is the traditional sign that you have an unterminated C string somewhere.)

What you actually want is the Go equivalent of C's strndup(), which guarantees to copy no more than N bytes of memory but will stop before then if it finds a null byte. Here is my version of it, with no guarantees:

func strndup(cs *C.char, len int) string {
   s := C.GoStringN(cs, C.int(len))
   i := strings.IndexByte(s, 0)
   if i == -1 {
      return s
   }
   return C.GoString(cs)
}

This code does some extra work in order to minimize extra memory usage due to how Go strings can hold memory. You may want to take the alternate approach of returning a slice of the GoStringN() string. Really sophisticated code might decide which of the two options to use based on the difference between i and len.

Update: Ian Lance Taylor showed me the better version:

func strndup(cs *C.char, len int) string {
   return C.GoStringN(cs, C.int(C.strnlen(cs, C.size_t(len))))
}

Yes, that's a lot of casts. That's the combination of Go and CGo typing for you.

GoCGoStringFunctions written at 23:49:44; Add Comment

Turning, well copying blobs of memory into Go structures

As before, suppose (not entirely hypothetically) that you're writing a package to connect Go up to something that will provide it with blobs of memory that are actually C structs; these might be mmap()'d files, information from a library, or whatever. Once you have a compatible Go struct, you still have to get the data from a C struct (or raw memory) to the Go struct.

One way to do this is to manually write your own struct copy function that does it field by field (eg 'io.Field = ks_io.field' for each field). As with defining the Go structs by hand, this is tedious and potentially error prone. You can do it and you'll probably have to if the C struct contains unions or other hard to deal with things, but we'd like an easier approach. Fortunately there are two good ones for two different cases. In both cases we will wind up copying the C struct or the raw memory to a Go struct variable that is an exact equivalent of the C struct (or at least we hope it is).

The easy case is when we're dealing with a fixed struct that we have a known Go type for. Assuming that we have a C void * pointer to the original memory area called ks.ks_data, we can adopt the C programmer approach and write:

var io IO
io = *((*IO)(ks.ks_data))
return &io

This casts ks.ks_data to a pointer to an IO struct and then dereferences it to copy the struct itself into the Go variable we made for this. Depending on the C type of ks_data, you may need to use the hammer of unsafe.Pointer() here:

io = *((*IO)(unsafe.Pointer(ks.ks_data)))

At this point, some people will be tempted to skip the copying and just return the 'casted-to-*IO' ks.ks_data pointer. You don't want to do this, because if you return a Go pointer to C data, you're coupling Go and C memory management lifetimes. The C memory must not be freed or reused for something else for as long as Go retains at least one pointer to it, and there is no way for you to find out when the last Go reference goes away so that you can free the C memory. It's much simpler to treat 'C memory' as completely disjoint from 'Go memory'; any time you want to move some information across the boundary, you must copy it. With copying we know we can free ks.ks_data safely the moment the copy is done and the Go runtime will handle the lifetime of the io variable for us.

The more difficult case is when we don't know what structs we're dealing with; we're providing the access package, but it's the callers who actually know the structs are. This situation might come up in a package for accessing kernel stats, where drivers or other kernel systems can export custom stats structs. Our access package can provide specific support for known structs, but we need an escape hatch for when the callers knows that some specific kernel system is providing a 'struct whatever' and it wants to retrieve that (probably into an identical Go struct created through cgo).

The C programmer approach to this problem is memmove(). You can write memmove() in Go with sufficiently perverse use of the unsafe package, but you don't want to. Instead we can use the reflect package to create a generic version of the specific 'cast and copy' code we used above. How to do this wasn't obvious to me until I did a significant amount of flailing around with the package, so I'm going to go through the logic of what we're doing in detail.

We'll start with our call signature:

func (k *KStat) CopyTo(ptri interface{}) error { ... }

CopyTo takes a pointer to a Go struct and copies our C memory in ks.ks_data into the struct. I'm going to omit the reflect-based code to check ptri to make sure it's actually a pointer to a suitable struct in the interests of space, but you shouldn't in real code. Also, there are a whole raft of qualifications you're going to want to impose on what types of fields that struct can contain if you want to at least pretend that your package is somewhat memory safe.

To actually do the copy, we first need to turn this ptri interface value into a reflect.Value that is the destination struct itself:

ptr := reflect.ValueOf(ptri)
dst := ptr.Elem()

We now need to cast ks.ks_data to a Value with the type 'pointer to dst's type'. This is most easily done by creating a new pointer of the right type with the address taken from ks.ks_data:

src := reflect.NewAt(dst.Type(), unsafe.Pointer(ks.ks_data))

This is the equivalent of 'src := ((*IO)(ks.ks_data))' in the type-specific version. Reflect.NewAt is there for doing just this; its purpose is to create pointers for 'type X at address Y', which is exactly the operation we need.

Having created this pointer, we then dereference it to copy the data into dst:

dst.Set(reflect.Indirect(src))

This is the equivalent of 'io = *src' in the type-specific version. We're done.

In my testing, this approach is surprisingly robust; it will deal with even structs that I didn't expect it to (such as ones with unexported fields). But you probably don't want to count on that; it's safest to give CopyTo() straightforward structs with only exported fields.

On the whole I'm both happy and pleasantly surprised by how easy it turned out to be to use the reflect package here; I expected it to require a much more involved and bureaucratic process. Getting to this final form involved a lot of missteps and unnecessarily complicated approaches, but the final form itself is about as minimal as I could expect. A lot of this is due to the existence of reflect.NewAt(), but there's also that Value.Set() works fine even on complex and nested types.

(Note that while you could use the reflect-based version even for the first, fixed struct type case, my understanding is that the reflect package has not insignificant overheads. By contrast the hard coded fixed struct type code is about as minimal and low overhead as you can get; it should normally compile down to basically a memory copy.)

Sidebar: preserving Go memory safety here

I'm not fully confident that I have this right, but I think that to preserve memory safety in the face of this memory copying you must insure that the target struct type does not contain any embedded pointers, either explicit ones or ones implicitly embedded into types like maps, chans, interfaces, strings, slices, and so on. Fixed-size arrays are safe because in Go those are just fixed size blocks of memory.

If you copy a C struct containing pointers into a Go struct containing pointers, what you're doing is the equivalent of directly returning the 'casted-to-*IO' ks.ks_data pointer. You've allowed the creation of a Go object that points to C memory and you now have the same C and Go memory lifetime issues. And if some of the pointers are invalid or point to garbage memory, not only is normal Go code at risk of bad things but it's possible that the Go garbage collector will wind up trying to dereference them and take a fault.

(This makes it impossible to easily copy certain sorts of C structures into Go structures. Fortunately such structures rarely appear in this sort of C API because they often raise awkward memory lifetime issues even in C.)

GoMemoryToStructures written at 02:50:28; Add Comment

2015-08-30

Getting C-compatible structs in Go with and for cgo

Suppose, not entirely hypothetically, that you're writing a package to connect Go up to something that will provide it blobs of memory that are C structs. These structs might be the results of making system calls or they might be just informational things that a library provides you. In either case you'd like to pass these structs on to users of your package so they can do things with them. Within your package you can use the cgo provided C.<whatever> types directly. But this is a bit annoying (they don't have native Go types for things like integers, which makes interacting with regular Go code a mess of casts) and it doesn't help other code that imports your package. So you need native Go structs, somehow.

One way is to manually define your own Go version of the C struct. This has two drawbacks; it's tedious (and potentially error-prone), and it doesn't guarantee that you'll wind up with exactly the same memory layout that C has (the latter is often but not always important). Fortunately there is a better approach, and that is to use cgo's -godefs functionality to more or less automatically generate struct declarations for you. The result isn't always perfect but it will probably get you most of the way.

The starting point for -godefs is a cgo Go source file that declares some Go types as being some C types. For example:

// +build ignore
package kstat
// #include <kstat.h>
import "C"

type IO C.kstat_io_t
type Sysinfo C.sysinfo_t

const Sizeof_IO = C.sizeof_kstat_io_t
const Sizeof_SI = C.sizeof_sysinfo_t

(The consts are useful for paranoid people so you can later cross-check the unsafe.Sizeof() of your Go types against the size of the C types.)

If you run 'go tool cgo -godefs <file>.go', it will print out to standard output a bunch of standard Go type definitions with exported fields and everything. You can then save this into a file and use it. If you think the C types may change, you should leave the generated file alone so you won't have a bunch of pain if you have to regenerate it; if the C types are basically fixed, you can annotate the generated output with eg godoc comments. Cgo worries about matching types and it will also insert padding where it existed in the original C struct.

(I don't know what it does if the original C struct is impossible to reconstruct in Go, for instance if Go requires padding where C doesn't. Hopefully it complains. This hope is one reason you may want to check those sizeofs afterwards.)

The big -godefs limitation is the same limitation as cgo has in general: it has no real support for C unions, since Go doesn't have them. If your C struct has unions, you're on your own to figure out how to deal with them; I believe cgo translates them as appropriate sized uint8 arrays, which is not too useful to actually access the contents.

There are two wrinkles here. Suppose you have one struct type that embeds another struct type:

struct cpu_stat {
   struct cpu_sysinfo  cpu_sysinfo;
   struct cpu_syswait  cpu_syswait;
   struct vminfo       cpu_vminfo;
}

Here you have to give cgo some help, by creating Go level versions of the embedded struct types before the main struct type:

type Sysinfo C.struct_cpu_sysinfo
type Syswait C.struct_cpu_syswait
type Vminfo  C.struct_cpu_vminfo

type CpuStat C.struct_cpu_stat

Cgo will then be able to generate a proper Go struct with embedded Go structs in CpuStat. If you don't do this, you get a CpuStat struct type that has incomplete type information; the 'Sysinfo' et al fields in it will refer to types called _Ctype_... that aren't defined anywhere.

(By the way, I do mean 'Sysinfo' here, not 'Cpu_sysinfo'. Cgo is smart enough to take that sort of commonly seen prefix off of struct field names. I don't know what its algorithm is for doing this, but it's at least useful.)

The second wrinkle is embedded anonymous structs:

struct mntinfo_kstat {
   ....
   struct {
      uint32_t srtt;
      uint32_t deviate;
   } m_timers[4];
   ....
}

Unfortunately cgo can't deal with these at all. This is issue 5253, and you have two options. The first is that at the moment, the proposed CL fix still applies to src/cmd/cgo/gcc.go and works (for me). If you don't want to build your own Go toolchain (or if the CL no longer applies and works), the other solution is to create a new C header file that has a variant of the overall struct that de-anonymizes the embedded struct by creating a named type for it:

struct m_timer {
   uint32_t srtt;
   uint32_t deviate;
}

struct mntinfo_kstat_cgo {
   ....
   struct m_timer m_timers[4];
   ....
}

Then in your Go file:

...
// #include "myhacked.h"
...

type MTimer C.struct_m_timer
type Mntinfo C.struct_mntinfo_kstat_cgo

Unless you made a mistake, the two C structs should have the same sizes and layouts and thus be totally compatible with each other. Now you can use -godefs on your version, remembering to make an explicit Go type for m_timer due to the first wrinkle. If you feel bold (and you don't think you'll need to regenerate things), you can then reverse this process in the generated Go file, re-anonymizing the MTimer type into the overall struct (since Go supports that perfectly well). Since you're not changing the actual contents, just where types are declared, the result should be layout-identical to the original.

PS: the file that's input to -godefs is set to not be built by the normal 'go build' process because it is only used for this godefs generation. If it gets included in the build, you'll get complaints about multiple definitions of your (Go) types. The corollary to this is that you don't need to have this file and any supporting .h files in the same directory as your regular .go files for the package. You can put them in a subdirectory, or keep them somewhere entirely separate.

(I think the only thing the package line does in the godefs .go file is set the package name that cgo will print in the output.)

GoCGoCompatibleStructs written at 03:37:54; Add Comment

2015-08-17

Why languages like 'declare before use' for variables and functions

I've been reading my way through Lisp as the Maxwell's equations of software and ran into this 'problems for the author' note:

As a general point about programming language design it seems like it would often be helpful to be able to define procedures in terms of other procedures which have not yet been defined. Which languages make this possible, and which do not? What advantages does it bring for a programming language to be able to do this? Are there any disadvantages?

(I'm going to take 'defined' here as actually meaning 'declared'.)

To people with certain backgrounds (myself included), this question has a fairly straightforward set of answers. So here's my version of why many languages require you to declare things before you use them. We'll come at it from the other side, by asking what your language can't do if it allows you to use things before declaring them.

(As a digression, we're going to assume that we have what I'll call an unambiguous language, one where you don't need to know what things are declared as in order to know what a bit of code actually means. Not all languages are unambiguous; for example C is not (also). If you have an ambiguous language, it absolutely requires 'declare before use' because you can't understand things otherwise.)

To start off, you lose the ability to report a bunch of errors at the time you're looking at a piece of code. Consider:

lvar = ....
res = thang(a, b, lver, 0)

In basically all languages, we can't report the lver for lvar typo (we have to assume that lver is an unknown global variable), we don't know if thang is being called with the right number of arguments, and we don't even know if thang is a function instead of, say, a global variable. Or if it even exists; maybe it's a typo for thing. We can only find these things out when all valid identifiers must have been declared; in fully dynamic languages like Lisp and Python, that's 'at the moment where we reach this line of code during execution'. In other languages we might be able to emit error messages only at the end of compiling the source file, or even when we try to build the final program and find missing or wrong-typed symbols.

In languages with typed variables and arguments, we don't know if the arguments to thang() are the right types and if thang() returns a type that is compatible with res. Again we'll only be able to tell when we have all identifiers available. If we want to do this checking before runtime, the compiler (or linker) will have to keep track of the information involved for all of these pending checks so that it can check things and report errors once thang() is defined.

Some typed languages have features for what is called 'implicit typing', where you don't have to explicitly declare the types of some things if the language can deduce them from context. We've been assuming that res is pre-declared as some type, but in an implicit typing language you could write something like:

res := thang(a, b, lver, 0)
res = res + 20

At this point, if thang() is undeclared, the type of res is also unknown. This will ripple through to any code that uses res, for example the following line here; is that line valid, or is res perhaps a complex structure that can in no way have 10 added to it? We can't tell until later, perhaps much later.

In a language with typed variables and implicit conversions between some types, we don't know what type conversions we might need in either the call (to convert some of the arguments) or the return (to convert thang()'s result into res's type). Note that in particular we may not know what type the constant 0 is. Even languages without implicit type conversions often treat constants as being implicitly converted into whatever concrete numeric type they need to be in any particular context. In other words, thang()'s last argument might be a float, a double, a 64-bit unsigned integer, a 32-bit signed integer, or whatever, and the language will convert the 0 to it. But it can only know what conversion to do once thang() is declared and the types of its arguments are known.

This means that a language with any implicit conversions at all (even for constants like 0) can't actually generate machine code for this section until thang() is declared even under the best of circumstances. However, life is usually much worse for code generation than this. For a start, most modern architectures pass and return floating point values in different ways than integer values, and they may pass and return more complex values in a third way. Since we don't know what type thang() returns (and we may not know what types the arguments are either, cf lver), we basically can't generate any concrete machine code for this function call at the time we parse it even without implicit conversions. The best we can do is generate something extremely abstract with lots of blanks to be filled in later and then sit on it until we know more about thang(), lver, and so on.

(And implicit typing for res will probably force a ripple effect of abstraction on code generation for the rest of the function, if it doesn't prevent it entirely.)

This 'extremely abstract' code generation is in fact what things like Python bytecode are. Unless the bytecode generator can prove certain things about the source code it's processing, what you get is quite generic and thus slow (because it must defer a lot of these decisions to runtime, along with checks like 'do we have the right number of arguments').

So far we've been talking about thang() as a simple function call. But there are a bunch of more complicated cases, like:

res = obj.method(a, b, lver, 0)
res2 = obj1 + obj2

Here we have method calls and operator overloading. If obj, obj1, and/or obj2 are undeclared or untyped at this point, we don't know if these operations are valid (the actual obj might not have a method() method) or what concrete code to generate. We need to generate either abstract code with blanks to be filled in later or code that will do all of the work at runtime via some sort of introspection (or both, cf Python bytecode).

All of this prepares us to answer the question about what sort of languages require 'declare before use': languages that want to do good error reporting or (immediately) compile to machine code or both without large amounts of heartburn. As a pragmatic matter, most statically typed languages require declare before use because it's simpler; such languages either want to generate high quality machine code or at least have up-front assurances about type correctness, so they basically fall into one or both of those categories.

(You can technically have a statically typed language with up-front assurances about type correctness but without declare before use; the compiler just has to do a lot more work and it may well wind up emitting a pile of errors at the end of compilation when it can say for sure that lver isn't defined and you're calling thang() with the wrong number and type of arguments and so on. In practice language designers basically don't do that to compiler writers.)

Conversely, dynamic languages without static typing generally don't require declare before use. Often the language is so dynamic that there is no point. Carefully checking the call to thang() at the time we encounter it in the source code is not entirely useful if the thang function can be completely redefined (or deleted) by the time that code gets run, which is the case in languages like Lisp and Python.

(In fact, given that thang can be redefined by the time the code is executed we can't even really error out if the arguments are wrong at the time when we first see the code. Such a thing would be perfectly legal Python, for example, although you really shouldn't do that.)

WhyDeclareBeforeUse written at 01:03:28; Add Comment

2015-08-04

A lesson to myself: commit my local changes in little bits

For quixotic reasons, I recently updated my own local version of dmenu to the upstream version, which had moved on since I last did this (most importantly, it gained support for Xft fonts). Well, the upstream version plus my collection of bugfixes and improvements. In the process of doing this I have (re)learned a valuable lesson about how I want to organize my local changes to upstream software.

My modifications to dmenu predate my recent decision to commit local changes instead of just carrying them uncommitted on top of the repo. So the first thing I did was to just commit them all in a single all in one changeset, then fetch upstream and rebase. This had rebase conflicts, of course, so I merged them and built the result. This didn't entirely work; some of my modifications clearly hadn't taken. Rather than try to patch the current state of my modifications, I decided to punt and do it the right way; starting with a clean copy of the current upstream, I carefully separated out each of my modifications and added them as separate changes and commits. This worked and wasn't particularly much effort (although there was a certain amount of tedium).

Now, a certain amount of the improvement here is simply that I was porting all of my changes into the current codebase instead of trying to do a rebase merge. This is always going to give you a better chance to evaluate and test things. But that actually kind of points to a problem; because I had my changes in a single giant commit, everything was tangled together and I couldn't see quite clearly enough to do the rebase merge right. Making each change independently made things much clearer and easier to follow, and I suspect that that would have been true even in a merge. The result is also easier for me to read in the future, since each change is now something I can inspect separately.

All of this is obvious to people who've been dealing with VCSes and local modifications, of course. And in theory I knew it too, because I've read all of those homilies to good organization of your changes. I just hadn't stubbed my toe on doing it the 'wrong' way until now (partly because I hadn't been committing changes at all until recently).

(Of course, this is another excellent reason to commit local changes instead of carrying them uncommitted. Uncommitted local changes are basically intrinsically all mingled together.)

Having come to my senses here, I have a few more programs with local hacks that I need to do some change surgery on.

(I've put my version of dmenu up on github, where you can see and cherry pick separate changes if desired. I expect to rebase this periodically, when upstream updates and I notice and care. As before, I have no plans to try to push even my bugfixes to the official release, but interested parties are welcome to try to push them upstream.)

Sidebar: 'git add -p' and this situation

In theory I could have initially committed my big ball of local changes as separate commits with 'git add -p'. In practice this would have required disentangling all of the changes from each other, which would have required understanding code I hadn't touched for two years or so. I was too impatient at the start to do that; I hoped that 'commit and rebase' would be good enough. When it wasn't, restarting from scratch was easier because it let me test each modification separately as I made it.

Based on this, my personal view is that I'm only going to use 'git add -p' when I've recently worked on the code and I'm confident that I can accurately split changes up without needing to test the split commits to make sure each is correct on its own.

CommitLittleChanges written at 02:47:15; Add Comment

2015-07-29

My workflow for testing Github pull requests

Every so often a Github-based project I'm following has a pending pull request that might solve a bug or otherwise deal with something I care about, and it needs some testing by people like me. The simple case is when I am not carrying any local changes; it is adequately covered by part of Github's Checking out pull requests locally (skip to the bit where they talk about 'git fetch'). A more elaborate version is:

git fetch origin pull/<ID>/head:origin/pr/<ID>
git checkout pr/<ID>

That creates a proper remote branch and then a local branch that tracks it, so I can add any local changes to the PR that I turn out to need and then keep track of them relative to the upstream pull request. If the upstream PR is rebased, well, I assume I get to delete my remote and then re-fetch it and probably do other magic. I'll cross that bridge when I reach it.

The not so simple case is when I am carrying local changes on top of the upstream master. In the fully elaborate case I actually have two repos, the first being a pure upstream tracker and the second being a 'build' repo that pulls from the first repo and carries my local changes. I need to apply some of my local changes on top of the pull request while skipping others (in this case, because some of them are workarounds for the problem the pull request is supposed to solve), and I want to do all of this work on a branch so that I can cleanly revert back to 'all of my changes on top of the real upstream master'.

The workflow I've cobbled together for this is:

  • Add the Github master repo if I haven't already done so:
    git remote add github https://github.com/zfsonlinux/zfs.git

  • Edit .git/config to add a new 'fetch =' line so that we can also fetch pull requests from the github remote, where they will get mapped to the remote branches github/pr/NNN. This will look like:
    [remote "github"]
       fetch = +refs/pull/*/head:refs/remotes/github/pr/*
       [...]

    (This comes from here.)

  • Pull down all of the pull requests with 'git fetch github'.

    I think an alternate to configuring and fetching all pull requests is the limited version I did in the simple case (changing origin to github in both occurrences), but I haven't tested this. At the point that I have to do this complicated dance I'm in a 'swatting things with a hammer' mode, so pulling down all PRs seems perfectly fine. I may regret this later.

  • Create a branch from master that will be where I build and test the pull request (plus my local changes):
    git checkout -b pr-NNN

    It's vitally important that this branch start from master and thus already contain my local changes.

  • Do an interactive rebase relative to the upstream pull request:
    git rebase -i github/pr/NNN

    This incorporates the pull request's changes 'below' my local changes to master, and with -i I can drop conflicting or unneeded local changes. Effectively it is much like what happens when you do a regular 'git pull --rebase' on master; the changes in github/pr/NNN are being treated as upstream changes and we're rebasing my local changes on top of them.

  • Set the upstream of the pr-NNN branch to the actual Github pull request branch:
    git branch -u github/pr/NNN

    This makes 'git status' report things like 'Your branch is ahead of ... by X commits', where X is the number of local commits I've added.

If the pull request is refreshed, my current guess is that I will have to fully discard my local pr-NNN branch and restart from fetching the new PR and branching off master. I'll undoubtedly find out at some point.

Initially I thought I should be able to use a sufficiently clever invocation of 'git rebase' to copy some of my local commits from master on to a new branch that was based on the Github pull request. With work I could get the rebasing to work right; however, it always wound up with me on (and changing) the master branch, which is not what I wanted. Based on this very helpful page on what 'git rebase' is really doing, what I want is apparently impossible without explicitly making a new branch first (and that new branch must already include my local changes so they're what gets rebased, which is why we have to branch from master).

This is probably not the optimal way to do this, but having hacked my way through today's git adventure game I'm going to stop now. Feel free to tell me how to improve this in comments.

(This is the kind of thing I write down partly to understand it and partly because I would hate to have to derive it again, and I'm sure I'll need it in the future.)

Sidebar: Why I use two repos in the elaborate case

In the complex case I want to both monitor changes in the Github master repo and have strong control over what I incorporate into my builds. My approach is to routinely do 'git pull' in the pure tracking repo and read 'git log' for new changes. When it's time to actually build, I 'git pull' (with rebasing) from the tracking repo into the build repo and then proceed. Since I'm pulling from the tracking repo, not the upstream, I know exactly what changes I'm going to get in my build repo and I'll never be surprised by a just-added upstream change.

In theory I'm sure I could do this in a single repo with various tricks, but doing it in two repos is much easier for me to keep straight and reliable.

GithubPRTestingWorkflow written at 23:08:58; Add Comment

2015-07-07

The Git 'commit local changes and rebase' experience is a winning one

I mentioned recently that I'd been persuaded to change my ways from leaving local changes uncommitted in my working repos to committing them and rebasing on pulls. When I started this, I didn't expect it to be any real change from the experience of pulling with uncommitted changes and maybe stashing them every so often and so on; I'd just be doing things the proper and 'right' Git way (as everyone told me) instead of the sloppy way.

I was wrong. Oh, certainly the usual experience is the same; I do a 'git pull', I get my normal pull messages and stats output, and Git adds a couple of lines at the end about automatically rebasing things. But with local commits and rebasing, dealing with conflicts after a pull is much better. This isn't because I have fewer or simpler changes to merge, it's simply because the actual user interface and process is significantly nicer. There's very little fuss and muss; I fire up my editor on a file or two, I look for the '<<<<' markers, I sort things out, I can get relatively readable diffs, and then I can move on smoothly.

(And the messages from git during rebasing are actually quite helpful.)

Re-applying git stashes that had conflicts with the newly pulled code was not as easy or as smooth, at least for the cases that I dealt with. My memory is that it was harder to see my changes and harder to integrate them, and also sometimes I had to un-add things from the index that git stash had apparently automatically added for me. I felt far less in control of the whole process than I do now with rebasing.

(And with rebasing, the git reflog means that if I need to I can revert my repo to the pre-pull state and see exactly how things were organized in the old code and what the code did with my changes integrated. Sometimes this is vital if there's been a significant restructuring of upstream code. In the past with git stash, I've been lucky because I had an intact pre-pull copy of the repo (with my changes) on a second machine.)

I went into this expecting to be neutral on the change to 'commit and rebase on pulls'. I've now wound up quite positive on it; I actively like and prefer to be fixing up a rebase to fixing up a git stash. Rebasing really is better, even if I just have a single small and isolated change.

(And thank you to the people who patiently pushed me towards this.)

GitCommitAndRebaseBetter written at 02:07:44; Add Comment

2015-07-03

Some notes on my 'commit local changes and rebase' Git workflow

A month or so ago I wrote about how I don't commit changes in my working repos and in reaction to it several people argued that I ought to change my way. Well, never let it be said that I can't eventually be persuaded to change my ways, so since then I've been cautiously moving to committing my changes and rebasing on pulls in a couple of Git repos. I think I like it, so I'm probably going to make it my standard way of working with Git in the future.

The Git configuration settings I'm using are:

git config pull.rebase true
git config rebase.stat true

The first just makes 'git pull' be 'git pull --rebase'. If I wind up working with multiple branches in repos, I may need to set this on a per-branch basis or something; so far I just track origin/master so it works for me. The second preserves the normal 'git pull' behavior of showing a summary of updates, which I find useful for keeping an eye on things.

One drawback of doing things this way is that 'git pull' will now abort if there are also uncommitted changes in the repo, such as I might have for a very temporary hack or test. I need to remember to either commit such changes or do 'git stash' before I pull.

(The other lesson here is that I need to learn how to manipulate rebase commits so I can alter, amend, or drop some of them.)

Since I've already done this once: if I have committed changes in a repo without this set, and use 'git pull' instead of 'git pull --rebase', one way to abort the resulting unwanted merge is 'git reset --hard HEAD'. Some sources suggest 'git reset --merge' or 'git merge --abort' instead. But really I should set pull rebasing to on the moment I commit my own changes to a repo.

(There are a few repos around here that now need this change.)

I haven't had to do a bisection on a commit-and-rebase repo yet, but I suspect that bisection won't go well if I actually need my changes in all versions of the repo that I build and test. If I wind up in this situation I will probably temporarily switch to uncommitted changes and use of 'git stash', probably in a scratch clone of the upstream master repo.

(In general I like cloning repos to keep various bits of fiddling around in them completely separate. Sure, I probably could mix various activities in one repo without having things get messed up, but a different directory hierarchy that I delete afterwards is the ultimate isolation and it's generally cheap.)

GitCommitAndRebaseNotes written at 01:21:50; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.