Wandering Thoughts

2019-04-10

A Git tool that I'd like and how I probably use Git differently from most people

For a long time now, I've wished for what has generally seemed to me like a fairly straightforward and obvious Git tool. What I want is a convenient way to page through all of the different versions of a file over time, going 'forward' and 'backward' through them. Basically this would be the whole file version of 'git log -p FILE', although it couldn't have the same interface.

(I know that the history may not be linear. There are various ways to cope with this, depending on how sophisticated an interface you're presenting.)

When I first started wanting this, it felt so obvious that I couldn't believe it didn't already exist. Going through past versions of a file was something that I wanted to do all the time when I was digging through repositories, and I didn't get why no one else had created this. Now, though, I think that my unusual desire for this is one of the signs that I use Git repositories differently from most people, because I'm coming at them as a sysadmin instead of as a developer. Or, to put it another way, I'm reading code as an outsider instead of an insider.

When you're an insider to code, when you work on the code in the repository you're reading, you have enough context to readily understand diffs and so 'git log -p' and similar diff-based formats (such as 'git show' of a commit) are perfectly good for letting you understand what the code did in the past. But I almost never have that familiarity with a Git repo I'm investigating. I barely know the current version of the file, the one I can read in full in the repo; I completely lack the contextual knowledge to mentally apply a diff and read out the previous behavior of the code. To understand the previous behavior of the code, I need to read the full previous code. So I wind up wanting a convenient way to get that previous version of a file and to easily navigate through versions.

(There are a surprising number of circumstances where understanding something about the current version of a piece of code requires me to look at what it used to do.)

I rather suspect that most people using Git are developers instead of people spelunking the depths of unfamiliar codebases. Developers likely don't have much use for viewing full versions of a file over time (or at least it's not a common need), so it's probably not surprising that there doesn't seem to be a tool for this (or at least not an easily found one).

(Github has something that comes close to this, with the 'view blame prior to this change' feature in its blame view of a particular file. But this is not quite the same thing, although it is handy for my sorts of investigations.)

GitViewFileOverTimeWish written at 01:27:06; Add Comment

2019-04-08

An example of a situation where Go interfaces can't substitute for generics

I recently read Why Go Contracts Are A Bad Idea In The Light Of A Changing Go Community (via). I have some views on this, but today I want to divert from them to touch on one thing I saw in the article (and that I believe I've seen elsewhere).

In the article, the author cites the stringer contract example from the draft proposal:

func Stringify(type T stringer)(s []T) (ret []string) {
  for _, v := range s {
    ret = append(ret, v.String())
  }
  return ret
}

contract stringer(x T) {
  var s string = x.String()
}

The author then says:

All that contracts are good for is ensuring properties of types. In this particular case it could (and should) be done simpler with the Stringer interface.

There are two ways to read this (and I don't know which one the author intended, so I am using their article as a springboard). The first way is that the contract is a roundabout way of saying that the type T must satisfy the Stringer interface, and we should be able to express this type restriction directly. I don't entirely argue with this, but I also don't think Go has any particularly compact and clear way of doing this now. Perhaps there should be a special syntax for it in a world with generics, although that depends on how many contracts will be basically specifying required method functions versus other requirements on types (such as comparability or addibility).

The other way of reading this is to say that our Stringify() example as a whole should be rewritten to use interfaces and not generics. Unfortunately this isn't possible; you can't write a function that behaves the same way using interfaces. This is because a non-generic function using interfaces must have the type signature:

func Stringify(s []Stringer) (ret []string)

The problem with this type signature is the famous and long standing Go issue that you cannot cast an array of some type to an array of some interface, even if the type satisfies the interface. The power of the generic version of Stringify is that it can work on your existing array of elements of some type; you do not have to manually create an array of those elements turned into interfaces.

The larger problem is that creating an interface value for every existing value is not free (even beyond the cost of a new array to hold them all). At a minimum it churns memory and makes extra work for the garbage collector. If you're starting with concrete values that are not pointers, you'll hit other efficiency issues as well when your Stringify calls the String() receiver method on your type.

The attraction of generics in this situation is not merely being able to implement generic algorithms in a way that is free of the sort of flailing around with interfaces that we see in the current sort package. It is also that the actual implementation should be efficient, ideally as efficient as a version written for the specific type you're using. By their nature, interfaces cannot deliver this level of efficiency; they always involve an extra layer of interface values and indirection.

(Even if you don't care about the efficiency, the need to transform your slice of T elements to a slice of Stringer interface values requires you to write some boring and thus annoying code.)

GoInterfacesVsGenerics written at 21:36:27; Add Comment

2019-03-08

Exploring how and why interior pointers in Go keep entire objects alive

Recently I was reading Things that surprised me in go (via), where one of the surprising things is that an interior pointer to a struct's field keeps the entire struct from being garbage collected. Here is a cut down and slightly changed version of the example from the article:

func getInteriorPointer() *int {
    type bigStruct struct {
        smallThing   int
        someBigThing [aLot]int
    }
    b := bigStruct{smallThing: 3}
    return &b.smallThing
}

After this returns, the entire bigStruct will remain in memory, not just the int of smallThing. As the author notes, in a hypothetical perfect GC this wouldn't happen; only the int would remain.

The direct reason that Go works this way is that it puts all data for a compound object like a struct or an array into one block of memory. This makes for efficient compound objects (both in terms of memory and in terms of access), at the cost of keeping the entire object alive as a unit. There are a number of versions of this, not just with structs; for example, a slice of a string will keep the entire string alive, not just the substring you've sliced out. The same is true of slices of arrays (such as arrays of bytes), even if you've explicitly limited the capacity of the slice so it can't be quietly grown using the original backing array.

But let's go deeper here. Let's suppose we want a perfect GC that still preserves this property of compound objects. Unfortunately there are at least two significant complications, one in determining what memory is still in use and one in actually freeing the memory.

Today, with indivisible compound objects, Go can keep a simple map from memory addresses to the block of memory that they keep alive and the garbage collector doesn't have to care about the type of a pointer, just its address, because the address alone is what determines how much memory is kept alive. In a world where compound objects are divisible, we must somehow arrange for &b.smallThing and &b to be treated differently by the garbage collector, either by giving the garbage collector access to type information or by having them point to different addresses.

(When planning out a scheme for this, remember that structs can be nested inside other structs. Go makes such struct embedding idiomatic, and it's natural to put your embedded struct at the start of your new struct; in fact I think it's the accepted style for this.)

The simplest way to deal with freeing only part of the memory of an object is for Go to have copying garbage collection. In such a hypothetical situation, you would just copy the int instead of the whole struct and everything would work out nicely. However Go does not have a copying GC today, so at a minimum freeing only part of a compound object leaves you with increased fragmentation of free memory; what was one big block of memory for the compound object gets broken up into fragments of free space with bits of still allocated space in various spots.

Unfortunately this partial freeing doesn't work in practice in Go's current memory allocation system in many cases. To simplify, Go splits up allocations into 'size classes' and then performs block allocation with a given size class (if you want to read more, start here). This is highly efficient and has various other advantages, but it means that you simply can't turn an allocation from one size class into free memory for another size class. Effectively you free either all of it or none of it.

You could change Go's memory allocation system to make this possible, but it's very likely that such a change would have performance impacts (and possibly memory usage impacts). With Go's current non-copying GC, you might well find that it often simply wasn't worth it to free up part of an object, either in memory savings or in CPU usage. Certainly you would have more complex memory allocation and garbage collection, and those are already complicated enough.

(With a copying GC, your new copied object can be allocated in the right size class, since the GC definitely knows how big it is now.)

GoInteriorPointerGC written at 22:33:13; Add Comment

2019-03-03

Understanding a change often requires understanding how the code behaves

Yesterday I wrote about how really understanding diffs requires knowing their context too. An additional part of the problem is that you can't understand most changes in isolation, just as a change. To actually understand a change, you usually need to be able to compare the before and after behavior of the code, and in order to do that you must understand (or work out) what that behavior is.

This is a big part of why you have to reconstruct those before and after versions of the code in your head as you're reading a diff. You need the ability to think 'okay, the old version does X and the new version does Y', and to do that you usually have to put the change into its context and then understand that context.

When you're familiar with the codebase, you generally already know how either the old version or the new version behaves in addition to having a mental image of the code itself. I suspect that this makes it much easier to both construct a mental image of the other side of the change and then understand what it does, and obviously you're already starting from a state where you know one side so you only need to do half the work.

(I might have thought to put this into yesterday's entry if I had written it at a different time. There's a story there, but that's for another entry.)

Sidebar: One trap in understanding changes is non-local effects

One bane of understanding changes even when you understand the code is that some changes can have non-local effects that aren't obvious. This spooky action at a distance often comes about because of either programming language behavior or subtle API changes (including especially implicit APIs in how one part of the code knows about how another part behaves). One famous example of this is an innocent looking change to the Linux kernel that caused a security issue, where adding a variable initialization silently removed a NULL check.

Of course this is nothing too novel. Removing spooky action at a distance in general is a big thrust of modern programming languages, because we've learned that it's dangerous basically all of the time.

DiffsRequireContextII written at 15:44:03; Add Comment

Really understanding diffs requires knowing their context too

A lot of things like presenting changes in the form of diffs (generally unified diffs these days, fortunately). Some of the time when I'm reading these diffs in various contexts (eg), I've felt that I was struggling to fully understand what I was seeing. Today, I had a realization about this that feels completely obvious in retrospect, namely that understanding diffs relies on implicit contextual knowledge of their surrounding code. Most of the time, to understand the change that a diff is making you need to mentally reconstruct an image of the code being modified, in much the same way that indirect manipulation UIs require you to construct mental context.

When you already know the code well, this is easy; you can readily 'see' both the before and the after version of the code. But if you don't really know the code and are reading it cold, this is a much harder thing to do. You don't have a clear mental image of what's going on and it's much harder to see and reason about the effects of the change, unless they're obviously and strictly local ones.

Unified diffs provide some obvious context, but generally not all of the function or other entity that the change is embedded in. Even when they do show everything, a diff is a compact encoding of a change and properly understanding the change requires imagining and to some extent understanding the before and after versions. I'm not sure there's any clear way to present this inline in text; I think the most comprehensible way of showing diffs will always be side by side (ideally in a graphical environment).

Realizing this makes me understand why I often struggle trying to make sense of a fair number of the diffs I read, because I'm often reading changes in codebases that I don't know (sometimes even in languages that I only partially understand). Probably I should see if I can easily use tools like kdiff3 when I'm picking my way through other people's code.

DiffsRequireContext written at 03:58:09; Add Comment

2019-02-18

When cloning git repos, things go faster if you start from a good base

Normally I get a copy of any git repo that I want by directly cloning from its upstream, whatever that is. Generally this goes fast enough and it guarantees that I have an exact copy of the repo, one that's not contaminated by anything else that I might have been doing to any other copy of the repo that I might already have. But generally I'm doing it on a 1G network link. Recently I needed a copy of Guenter Roeck's tree of hwmon updates for the Linux kernel on my home machine, where I had not previously cloned it. My first attempt was a direct clone from kernel.org, and let me tell you, it didn't go all that fast over my DSL link.

Like probably everyone else with a Linux kernel tree, Guenter Roeck's tree is ultimately a descendant from the regular official Linux kernel tree. I already keep a clone of the regular Linux kernel tree at home (and at work), because I wind up referring to it often enough. So, I wondered, what if I started by making a copy of my kernel tree, then added Guenter Roeck's as an additional upstream and fetched it?

Well:

remote: Counting objects: 2269, done.
remote: Compressing objects: 100% (567/567), done.
remote: Total 2269 (delta 1804), reused 2061 (delta 1702)
Receiving objects: 100% (2269/2269), 1.30 MiB | 268.00 KiB/s, done.
Resolving deltas: 100% (1804/1804), done.

Let me assure you that cloning a full Linux kernel tree involves a lot more than 2269 objects and a lot more than 1.3 MiB of data.

Because of git's fundamental nature as a content-addressable data store, in theory this trick works on anything with significant object overlap, not just things that ultimately descend from the same source. In practice this is generally unimportant; almost everything you're going to want to pull this trick on has a common ancestor.

(The possible exception is if two separate groups are maintaining git repos that are converted from something else, such as a Mercurial repo. At least the objects should be identical between the two repos, and if you're lucky maybe the commits as well, despite these repos not having a common git ancestor.)

This feels like an obvious trick now that I've done it once, so I'm probably going to try to do it more. There are some variations of the trick one can probably perform, such as actively changing the 'origin' upstream over to the upstream you really want to be based on and pulling from. My one question about that would be how one cleans up branches (and perhaps tags) that are only found in the repo you started out by cloning from.

GitCloningBaseBenefit written at 21:34:09; Add Comment

2019-02-15

Accumulating a separated list in the Bourne shell

One of the things that comes up over and over again when formatting output is that you want to output a list of things with some separator between them but you don't want this separator to appear at the start or the end, or if there is only one item in the list. For instance, suppose that you are formatting URL parameters in a tiny little shell script and you may have one or more parameters. If you have more than one parameter, you need to separate them with '&'; if you have only one parameter, the web server may well be unhappy if you stick an '&' before or after it.

(Or not. Web servers are often very accepting of crazy things in URLs and URL parameters, but one shouldn't count on it. And it just looks irritating.)

The very brute force approach to this general problem in Bourne shells goes like this:

tot=""
for i in "$@"; do
  ....
  v="var-thing=$i"
  if [ -z "$tot" ]; then
    tot="$v"
  else
    tot="$tot&$v"
  fi
done

But this is five or six lines and involves some amount of repetition. It would be nice to do better, so when I had to deal with this recently I looked into the Dash manpage to see if it's possible to do better with shell substitutions or something else clever. With shell substitutions we can condense this a lot, but we can't get rid of all of the repetition:

tot="${tot:+$tot&}var-thing=$i"

It annoys me that tot is repeated in this. However, this is probably the best all-around option in normal Bourne shell.

Bash has arrays, but the manpage's documentation of them makes my head hurt and this results in Bash-specific scripts (or at least scripts specific to any shell with support for arrays). I'm also not sure if there's any simple way of doing a 'join' operation to generate the array elements together with a separator between them, which is the whole point of the exercise.

(But now I've read various web pages on Bash arrays so I feel like I know a little bit more about them. Also, on joining, see this Stackoverflow Q&A; it looks like there's no built-in support for it.)

In the process of writing this entry, I realized that there is an option that exploits POSIX pattern substitution after generating our '$tot' to remove any unwanted prefix or suffix. Let me show you what I mean:

tot=""
for i in "$@"; do
  ...
  tot="$tot&var-thing=$i"
done
# remove leading '&':
tot="${tot#&}"

This feels a little bit unclean, since we're adding on a separator that we don't want and then removing it later. Among other things, that seems like it could invite accidents where at some point we forget to remove that leading separator. As a result, I think that the version using '${var:+word}' substitution is the best option, and it's what I'm going to stick with.

BourneSeparatedList written at 23:12:33; Add Comment

2019-02-06

Using a single git repo to compare things between two upstreams

The other day I wrote about hand-building an updated upstream kernel module. One of the things that I wanted to do in that is to compare the code of the nct6775 module I wanted to build between the 4.20.x branch in the stable tree and the hwmon-next branch in Guenter Roeck's tree. In my entry, I did this by cloning each Git repo separately and then running diff by hand, but this is a little awkward and I said that there was probably a way to do this in a single Git repo. Today I have worked out how to do that, and so I'm going to write it down.

To do this we need a single Git repo with both trees present in it, which means that both upstream repos need to be remotes. We can set up one as a remote simply by cloning from it:

git clone [...]/groeck/linux-staging.git

(I've chosen to start with the repo I'm theoretically going to be building from, instead of the repo I'm only using to diff against.)

Then we need to add the second repo as a remote, and fetch it:

cd linux-staging
git remote add stable [...]/stable/linux.git
git fetch stable

At this point 'git branch -r' will show us that we have all of the branches from both sides. With the data from both upstreams in our local repo and a full set of branches, we can do the full form of the diff:

git diff stable/linux-4.20.y..origin/hwmon-next drivers/hwmon/nct6775.c

We can make this more convenient by shortening one or both names, like so:

git checkout linux-4.20.y
git checkout hwmon-next

git diff linux-4.20.y.. drivers/hwmon/nct6775.c

I'm using 'git checkout' here partly as a convenient way to run 'git branch' with the right magic set of options:

git branch --track linux-4.20.y stable/linux-4.20.y

Actually checking out hwmon-next means we don't have to name it explicitly.

We can also diff against tags from the stable repo, and we get to do it without needing to say which upstream the tags are from:

git diff v4.20.6.. drivers/hwmon/nct6775.c
git diff v4.19.15.. drivers/hwmon/nct6775.c

The one drawback I know of to a multi-headed repo like this is that I'm not sure how you get rid of an upstream that you don't want any more. At one level you can just delete the remote, but that leaves various things cluttering up your repo, including both branches and tags. Presumably there is a way in Git to clean those up and then let Git's garbage collection eventually delete the actual Git objects involved and reclaim the storage.

(One can do more involved magic by not configuring the second repo as a remote and using 'git fetch' directly with its URL, but I'm not sure how to make the branch handling work nicely and so on. Setting it up as a full remote makes all of that work, although it also pulls in all tags unless you use '--no-tags' and understand what you're doing here, which I don't.)

Looking back, all of this is relatively basic and straightforward and I think I knew most of the individual bits and pieces involved. But I'm not yet familiar and experienced enough with git to confidently put them all together on the fly when my real focus is doing something else.

(Git is one of those things that I feel I should be more familiar with than I actually am, so every so often I take a run at learning how to do another thing in it.)

GitCompareAcrossUpstreams written at 23:10:50; Add Comment

2019-01-31

What getopt package I use for option handling in my Go programs

One of Go's famous issues is that the standard library's flag package doesn't handle command line options in the normal Unix getopt style. When I started with Go this quite irritated me, and then later I gave up and decided to use flag anyway. Since then, I've changed my mind again and now all of my recent Go programs use a third party package that gives me more Unix-style option handling. There are many of these packages; the one that I settled on is github.com/pborman/getopt/v2.

(At one point I played around with github.com/spf13/cobra, but apparently it didn't stick. I think it was too complicated for the sort of small programs I usually write, which don't normally have sub-commands and so on.)

What I like about pborman/getopt is that it's straightforward to use and it gives you Unix style command handling with no fuss. I use the API format that uses existing variables, so my code tends to look like:

getopt.Flag(&oneline, 'l', "List one machine per line")
getopt.FlagLong(&help, "help", 'h', "Print this help")
[...]

The exception to this is counters, such as the usual -v:

vp := getopt.Counter('v', "Increase verbosity for some operations")

Generally I've found the API straightforward, and I like that it's relatively simple to add some additional help text after the flags using getopt.SetUsage() and so on. My programs not infrequently have some extra usage information, but not enough to call for a full separate option to dump it out; tacking it on the end of the flags is my usual approach.

(This is where I point to the GoDoc for a full API reference.)

A lot of people seem to like the Kingpin package; I know that many Prometheus programs use it, for example, as does Square's certigo and acmetool. If I had complicated needs I would probably look into it.

(Like Cobra, Kingpin handles sub-commands with sub-flags and so on. You can see an example of that in certigo.)

PS: Now that I look at GoDoc, this package doesn't seem to be very popular nowadays. Things like cobra, pflag, and Kingpin are reported as being used by many more packages. If you're afraid of package rot and other problems, this may influence your choice. On the other hand, I tend to think that Unix getopt style option handling probably doesn't need much further development.

GoMyGetoptChoice written at 23:54:29; Add Comment

2019-01-28

Go 2 Generics: some features of contracts that I like

A major part of the Go 2 generics draft design is its use of contracts to specify constraints on what types are accepted by generic functions, which are essentially formalized function bodies. On the one hand I think that contracts today are too clever, but on the other hand I think that interfaces are not the right model for type constraints. Although the existing first Go 2 generics draft is probably dead at this point, since it's clear that the community is split, I still feel like writing down some features of contracts that I like (partly because I'd like to see them preserved in any future proposal).

The first thing is that contracts are effectively an API and are explicitly decoupled from the implementation of generic functions. This is good for all of the reasons that an API is good; it both constrains and frees the implementer, and lowers the chances that either users or implementers (or both) are quietly relying on accidental aspects of the implementation. As an API, in the best case contracts can provide straightforward documentation that is independent from a potentially complex implementation.

(I think it would be hard to provide this with the current version of contracts, due to them being too clever, but that's a separate thing.)

Because contracts have an independent existence, multiple things can all use the same contract. Because they're all using the same contract, it's guaranteed that they all accept the same types and that a type that can be used with one can be used with all. This is directly handy for methods of generic types (which might be a bit hard otherwise), but I think it's a good thing in general; it makes it both clear and easy to create a related group of generic functions that all operate on the same things, for example.

(In theory you can do this even if each implementation has a separate expression of the type constraints; you just make all of the separate type constraints be the same. But in practice this is prone to various issues and mistakes. A single contract both makes it immediately clear that everything accepts the same type and enforces that they do.)

The final thing I like about contracts is that they can explicitly use (and thus require) struct fields. This is a somewhat contentious issue, but I don't like getters and setters and I think that allowing for direct field access is more in the spirit of Go's straightforward efficiency. Perhaps a sufficiently clever compiler could inline those little getter and setter functions, but with direct struct fields you don't have to count on that.

(I also feel that direct access to struct fields is in keeping with direct access to type values, which very much should be part of any generics implementation. If you cannot write Min() and Max() generic functions that are as efficient as the non-generic versions, the generics approach is wrong.)

Go2ContractsLike written at 23:39:20; Add Comment

(Previous 10 or go back to January 2019 at 2019/01/17)

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.