Wandering Thoughts

2022-05-07

Solving a problem I had with the Unix date command in the right way

I have a personal shell script that for reasons beyond the scope of this entry uses the Unix date command to determine the current day, month, and year. Recently I noticed that the shell script was sometimes not working right on our Ubuntu 20.04 and 22.04 servers, but was working right on 18.04, and tracked it down to the fact that doing 'ssh server date' produced output in a different format on 18.04 than on the other two.

; ssh server1804 date
Sat May  7 20:56:00 EDT 2022
; ssh server2204 date
Sat May  7 08:56:00 PM EDT 2022

If I actually logged in to a 20.04 or 22.04 system and set up my normal environment, the date output reverted to the 18.04 traditional Unix format. Obviously this is some Unix (Linux) locale fun and games.

My first instinct was to spend time figuring out what magic environment variable or system setting was different between the machines, so I could get the date format to come out right, so my script could parse it properly again. Then I metaphorically bopped myself on the forehead, because why was I even parsing the default output of date at all. Date has supported specifying the output format for a long time, so if what I wanted was the day, the month, and the year, I should just change the script to have date output only and exactly that:

dout="$(date +'%e %b %Y')"

Figuring out what's wrong with locales (or anything else) can be fun and interesting, but sometimes the right answer is to go around the problem entirely. I probably should remember and apply this more often than I do.

(Of course I could go one step further to just run date three times, once for each of the day, the month, and the year. But I reflexively twitch at things like that, even if it will never matter for this particular personal script.)

PS: The hard to believe reason that this script didn't use a date format from the start is that it's old enough to predate versions of date that supported this. It actually hasn't changed all that much from the oldest-dated version I could easily find in my collection of stuff.

DateFormatRightWay written at 21:51:58; Add Comment

2022-04-24

Some things that make shell scripts have performance issues

Yesterday I mentioned that one version of my shell script that probably shouldn't be a shell script had performance problems. Both versions of this script actually make a useful illustration of some things that make shell scripts slow and, along with it, some of the things you may have to do to make them fast.

One thing that does not make shell scripts slow is the basic Unix commands themselves that you use in shell scripts. Those Unix commands generally perform pretty well, and their processing speed is probably close to the fastest you could get if you wrote what they're doing in your language of choice. Your program is unlikely to improve on the sorting performance of sort, the text transformation performance of sed, and so on. And the shell itself generally performs internal things more than fast enough for most cases. Instead, what causes shell scripts problems is the cost of starting separate programs. Sed may transform text very fast and sort may sort data very fast, but starting sed or sort is comparatively expensive. The more times you start programs and the more programs you have to start for each thing you want to do, the slower your shell script will run.

(Programming languages that do everything internally may do each individual thing slower, but they don't pay the costs of starting however many external programs that your shell script needs.)

This causes two performance issues for shell scripts. The first, obvious issue is when you have to string together a whole sequence of Unix commands to get some result that would be straightforward in another programming language that had better text manipulation, more ability to read files, and so on. In turn this causes you to write shell scripts that use convoluted means to do things simply to keep down the number of programs being started. These convoluted means are faster than the straightforward option but make your script less readable. Shells try to deal with this by making more commands built in and by adding things like (integer) arithmetic so that you don't have to run external programs for common operations.

The second issue is that if your shell script deals with multiple things (for example, multiple entries in a Linux cgroup hierarchy), it's increasingly expensive to process them one by one because you repeatedly pay the program startup cost. The less you can do on a per-item basis, the better your script will perform, especially as the number of items grows. This leads to restructuring your script to try to do as much 'stream' processing as possible, even if this results in a peculiar program structure and often peculiar intermediate steps; alternately, you can rewrite things in a more awkward way that maximizes your use of shell builtins (where you don't pay a per-program cost). In a language without this per-item penalty, a program written in the natural style of processing an item at a time will still perform well.

Related to this, you can easily write a shell script that appears to perform well enough in your test environment but has clear problems when run for real in environments with significantly more items. It's not necessarily obvious how many per item programs are too many (or how many items your real environment will have), which makes this a hard issue to prevent in advance. Do you go out of your way to make your program do complex stream processing, possibly with no need in the end, or do you write the straightforward version now only to perhaps throw it away later? There's no good answer.

PS: One traditional way to deal with this in shell scripts is to lean on some program to assist your script that can swallow as much of the work into itself as possible. Awk is one common option chosen for this.

PPS: I don't like to admit it because I don't really like Perl, but this is one area where Perl feels like a pretty natural fit, partly because a lot of its basic operations are fairly close to the sort of manipulations you'd do in a shell script. My 'memdu' scripts might well look pretty much like their current state in Perl, just with better structure and performance, and I suspect that the transformation wouldn't be too hard if I hadn't forgotten all of the Perl I once knew.

ShellScriptsAndSpeed written at 22:08:51; Add Comment

2022-04-23

The temptation of writing shell scripts, illustrated

It's an article of faith in many quarters that you shouldn't write anything much as a shell script and should instead use a proper programming language. I generally agree with this in theory, but recently I went through a great experience of why this doesn't necessarily work out for me in practice, as I wrote (and then rewrote) a shell script that really should be a program in, say, Python.

A systemd-based Linux system can be set to track how much memory is being used by each logged in user and by system services, and we configure many of our machines to do so. Systemd calls this MemoryAccounting and it's actually implemented using Linux cgroup memory accounting (there's also the cgroups v1 memory accounting). Because it's implemented with cgroups, the actual memory usage is visible under /sys/fs/cgroup and you can read it out directly by looking at various files. Recently we had an incident where I wound up wanting a convenient way to get a nice view of per-user and per-service memory usage, and it occurred to me that I could present this in the style of nice, easily readable disk space usage:

  43.7G  /
  39.5G  /u
  35.6G  /u/<someone>
   4.4G  /system
   1.9G  /system/auditd
 601.9M  /u/cks
 277.3M  /system/cron
 259.0M  /system/systemd-journald
  89.3M  /init
[...]

This started out looking very easy. All I had to do was read some files and reported the contents. Well, and humanize the raw number of bytes to make things more readable. And transform the names of the directories where the files were, to give things like '/u/cks' instead of 'user.slice/user-NNN.slice' (which requires looking up the login name for Unix uids). And skip things with no memory usage. And handle both cgroup v1 (used on most of our machines) and cgroup v2 (now used on Ubuntu 22.04). And maybe descend several levels deep into the hierarchy to get interesting details; for instance, users may have multiple sessions with widely differing memory usage. And if we're going to descend several levels deep, perhaps we should skip lower levels that have the same usage as their parent.

However, I didn't start out realizing all of these needs and nice things right away. I started out with something very simple that could just give a few easy to get numbers for user.slice, system.slice, and the root of the hierarchy. Extending that to system slices and user slices looked simple, and it wasn't too hard to transform UIDs into login names with the id command, and so on and so forth as one issue after another surfaced, including deciding how to look deeper into bits of the hierarchy. Taken one by one, almost every issue looked simple to solve on its own in a shell script, but the end result of putting all of these together is a shell script that almost certainly would have been easier to write in a programming language.

My rewrite came when I realized that I could turn the problem of looking through the hierarchy inside out, by using find to walk the entire cgroup hierarchy looking for anything that had memory usage. This gave me a whole new set of fun name transformation problems, and also showed another problem of shell scripts, which is that the result is now too slow because it has to keep invoking sed and other things on tons of names. But once again, each step toward the end result looked simple and approachable as just another bit of shell or sed mangling.

Meanwhile, starting out what felt like a simple thing in Python, Go, or any of the other alternatives looked like a much bigger investment of my effort. Python, Go and so on need more structure and often don't have quite as simple and convenient methods of doing various shell like things. The problem wasn't obviously too big for a shell script when I started writing the first bits of what I've named 'memdu', so I didn't want to go all the way to a Python program, and anyway shell scripts are more 'lightweight' than Python, never mind Go. And so I slid into the temptation of shell scripts, where every individual step looks easy enough but at the end, I probably would have been better off starting out in something else.

(Hopefully I will take all of this as a learning experience and motivate myself to rewrite the script in Python. But on the other hand, the resulting shell script is working and I'm lazy.)

ShellScriptTemptation written at 22:32:44; Add Comment

2022-03-27

Some thoughts on Go's unusual approach to identifier visibility

A while back I read Jesse Duffield's Go'ing Insane Part Two: Partial Privacy (eventually via). In this installment, Duffield is frustrated by how Go controls the visibility of identifiers outside the package:

Unlike in other languages where privacy is controlled with private or public keywords, Go marks privacy with capitalisation. [...]

I suspect that this is an issue in programming language design that people have strong opinions on, like Python's use of significant indentation; some people will find it appealing or at least be fine with it, while others will have strong negative reactions. I'm in the former camp. I don't object to this Go design decision and I find that I like some things about it.

The obvious nice thing about Go's approach is that you're never in doubt about whether an identifier is public or not when you read code. If it starts with upper case, it's public; otherwise, it's package private. This doesn't mean that a public identifier is supposed to be used generally, but at least it's clear to everyone that it could be. In other languages, you may have to consult the definition of the identifier, or perhaps a section of code that lists exported identifiers.

(The obvious drawback of Go's approach is that you can't capitalize the start of identifier names just because you feel that they read better that way, or because you're talking about something that is normally written capitalized. So you have to use names like 'ipv4_is_on' instead of the capitalized 'IPv4' version.)

One of Jesse Duffield's specific issues is that Go's decision here makes changing identifiers from private to public (or vice versa) a quite noisy change in your own package, since the name changes and you have to change all uses of it. One of my reactions to this is that this is a good thing. Making something public (or private) is a change in your API, and changes in your API should often be at least annoying in order to discourage you from doing them. A one-line diff for an API change feels too modest and unremarkable for such a significant thing.

(At the level of making the change, modern IDEs support 'rename identifier' operations and so the actual change is not irritating for most people. Gopls helps make this work even for people using editors like Emacs or Vi(m).)

PS: That you have to specifically annotate the JSON key names you want is, in my opinion, also not a drawback. It avoids the temptation to bias the Go names of struct fields over to what is normal and usual in JSON. JSON is a different language with different naming conventions than Go (and your Go code base).

GoOnIdentifierVisibility written at 22:54:38; Add Comment

2022-03-16

People might want to think about saving a copy of Go 1.17

The news of the time interval is that Go 1.18 has been released. There are a number of famous big changes in Go 1.18; obviously generics, but I think that official fuzzing support and workspaces are likely to have a bigger impact in the near future (certainly I think that more people should use fuzzing than should touch generics right now). But there's another important change, which is that module mode is now mandatory, although the release notes don't directly say that.

(Dropping support for GOPATH based local builds has been coming for some time, although it was pushed back one release from what was initially planned, from Go 1.17 to Go 1.18. I noted this when Go 1.17 was close to release, but I don't think the Go developers ever said anything about why the change happened.)

This means that Go 1.17 is the last Go release that can be used to build non module aware programs from a local source tree. For now, Go 1.18 continues to be able to build non module aware programs from a supported more or less public repository, with 'go install ...@latest', but I suspect that support for that is not long for this world and of course you need everything to be accessible. See what you can and can't build in Go's module mode for more.

In theory you can create a go.mod file for every program you have that's still without it (hopefully they're third party programs; if not, you need to update things). In practice, it's been my experience that creating a working go.mod file can take some work and fiddling for various reasons beyond the scope of this entry (although perhaps that's gotten better recently, and I haven't looked to see if workspaces help here). Also in theory, people should already have been adding go.mod files to their programs and will now have even more reason to do that; in practice, there are any number of perfectly good old Go programs that you may still find valuable that their authors consider 'done'.

(A program that previously gave me problems with a manual modularization did build with 'go install ...@latest', although I'm not sure it used the same versions of other packages.)

Because of all of this, you might want to consider planning to save a copy of Go 1.17. You probably don't want to freeze it just yet, since there may be future patch releases of Go 1.17 before it drops out of support when Go 1.19 is released, but at least you can plan for it and remind yourself to set it aside and not discard it now that 1.18 is out. My own view is that you should plan to save already compiled binaries; there's no guarantee that future versions of Go will be able to build Go 1.17 or that the Go people will keep the pre-built binaries around forever.

(Recent versions of Go require the Go source tree to have some go.mod files in strategic places. Old versions of Go don't have them, so you can run into problems. I care about old versions of Go because I sometimes use a machine running FreeBSD 10, which was last officially supported in Go 1.12, although recent versions of Go do seem to still work on it.)

There are limits to how useful this might be on some platforms, such as OpenBSD, but on mainstream platforms it's very likely that a Go 1.17 binary that works today will keep working even with future versions of the operating system. Fortunately modern Go releases are essentially indifferent to where they're put, and if for some reason you do care (for example for completely reproducible binaries, although the situation covered in Reproducing Go binaries byte-by-byte may have changed since then) you can always rebuild Go 1.17 from source with itself.

Go117SaveACopy written at 22:11:14; Add Comment

2022-02-13

Go generics: the question of types made from generic types and type sets

In yesterday's entry on the 'any' confusion in Go generics, I brought up the fact that you can't create types (or variables of types) using generic types that are instantiated with type sets. That sounds abstract, so let's make it concrete:

type Result[T any] []T
type Toable[T any] interface {
  Fred(s string) T
  Bar(i int) T
}
type Addable interface {
  ~uint32 | ~float64
}

// The following are both invalid
var r2 Result[Toable]
var r3 Result[Addable]

Of course, you also can't declare plain variables of type sets:

// Also both invalid
var a Toable
var b Addable

There are a number of people who would like a future version of Go to support this. However, there are two important and very open questions about it: what it would mean, and how it would be implemented.

In the abstract, what b means is probably straightforward; most people would assume that it's a variable that can hold either a uint32 or a float64, since those are the two underlying types allowed by the type set Addable. What a means is less clear, since Toable is a parameterized type set. The most likely intended meaning is that it can hold a value of a type that satisfies Toable for some type T, which in this case is to say that it has 'Fred(s string)' and 'Bar(i int)' methods that return the same type.

The first open question is whether either a or b (or both) hold interface values, concrete values, or some new third type of value that would have to be invented and added to Go in a coherent way. In the case of a, it certainly seems natural for it to hold interface values, but people might not be happy if b held interface values, with the runtime indirection that that implies. But on the other hand, it seems that you'd need some sort of type assertion to extract concrete values even from b.

The next question is what can you do with a and b. For instance, consider:

// in a different package
var C Addable

// in our package
b = b + package.C

Is this valid? If it does, how can it even work? What happens if at runtime the underlying type of b is float64 and the underlying type of C is uint32?

We get into equally exciting problems with a:

var e int
d := a.Fred("test")
e = a.Bar(10)

What is the type of d and how is it determined? What happens if the actual type of a at runtime is Toable[string], and so the second method call is attempting to assign a string to an int?

One answer is that you can't do anything with a and b except type assert them into specific concrete types; you can't call methods on them (even if you know they have the methods because of their type set) or do arithmetic (even if you know the arithmetic is allowed in general because of their type set). I suspect that people would find this unsatisfactory. However, other answers seem likely to substantially increase the potential for runtime panics in ordinary looking Go code, which is also not a good thing.

This brings us to the implementation questions, especially since Go is a deliberately simple language that attempts to be straightforward and obvious in most of its concrete types (including interface values, which have a straightforward two pointer representation). Go prefers fixed size values with simple representations (although they can hide internal complexity behind pointers, as maps and channels do). It's difficult to see how this could be readily achieved without simply making all such type set values be interfaces and then probably forbidding any operations on them except type assertions.

If these type set values are to be represented as their concrete types, there are some big questions about storage allocations. For instance, consider the backing array for 'r3', which is a slice that holds either uint32 values (which take up 4 bytes) or float64 values (which take up 8 bytes). Does the backing array use an element size that can hold the largest possible element? How does anything tell which element is which type? Do we have to invent a new internal Go representation that contains a type tag of some sort? And people can create type sets that contain large structs among their options.

(One way this might happen naturally is if you're creating a generic type set that includes types from other packages, which you're treating as opaque. If they happen to be implemented as fixed size arrays or as structs, you could get a surprise.)

Next, consider a version of Addable that also allows ~string. Now r3 may contain a mixture of 8-byte elements, some of which are pointers (the string values) and some of which are not (the float64 values). How does the Go garbage collector sort that one out? It will need to have (and access) type information on a per-element basis.

(Interface values are simpler for the garbage collector and for the Go runtime in general because they have a uniform representation, regardless of what type of interface they are.)

There are almost certainly no answers here that don't make Go a more complicated language with a more complicated implementation. To the extent that some useful features might be extracted from allowing type sets to be used as types, I think that they should be built with other new mechanisms, mechanisms that specifically address the problems they're intended to solve.

(I believe that a popular problem is a desire for what Rust calls 'enum variants with associated data'. I think that people who want this in Go would be better served by specifically designing something for it, although I suspect it may be too complicated to make the Go developers happy.)

GoTypesOfTypeSetsQuestion written at 23:09:09; Add Comment

2022-02-12

The 'any' confusion in Go generics between type constraints and interfaces

Any system of generic types, such as Go will have in Go 1.18, needs some way to specify constraints on the specific types that generic code can take. Go uses what it calls "type sets", which reuse Go's existing interface types with some extensions. However, this reuse creates a potential for confusion, one that I've already seen come up in some articles about Go generics such as this one (via).

Suppose that you have some generic types and code (and interfaces):

type Result[T any] []T
type Fredable[T any] interface {
   Fred(s string) T
}

type Barable interface {
   Bar() uint64
}

type Addable interface {
    ~uint | ~float64
}

The Barable type is an interface type that can be used today in Go 1.17, but it's also usable as a generic type constraint in a way that's different from just having a function that takes arguments with the interface type:

func DoBar[T Barable](a, b T) uint64 {
   return a.Bar() + b.Bar()
}

(Among other things, this generic function requires a and b to have the same type when it's instantiated to a specific type.)

Now consider the following set of declarations using the Result generic type that creates a slice of the given type:

var r1 Result[Barable]  // okay
var r2 Result[Fredable] // error
var r3 Result[Addable]  // error

var r4 Result[any]      // okay. What?

The type of r1 is a slice of Barable interface values. Barable is a regular interface type and you can declare slices of interface types, which contain interface values of that (interface) type. You cannot declare r2 or r3 because although they are both declared using the type and interface keywords, neither Fredable nor Addable are normal interface types. They're only usable as type constraints, and the Result generic type needs a type, not a type constraint.

The potentially confusing case is the last one, 'Result[any]'. Right now, 'any' is new syntax that generally only shows up in articles about Go generics, as a type non-constraint that means 'any type is acceptable'. However, it's an alias for 'interface{}', the universal interface. Used in or as a type constraint, it means that there is no constraint on the types that the generic type can be used with (you can make a slice of anything). Used as a regular type, though, it means what it usually means, so r4 is a slice of 'interface{}' values (and you'll be able to add anything to it, because anything can be converted to an 'interface{}' value).

My personal view is that it might be simpler if 'any' was only accepted in type constraints and couldn't be used as a regular interface type. This is already the case with the 'comparable' type constraint, which doesn't map naturally to something in normal, non-generic Go. If 'any' could only be used in type constraints, I think 'interface{}' should still be accepted as equivalent to 'any' there. But I understand why the Go developers did 'any' this way, especially since the type sets approach requires 'interface{}' to be equivalent to it.

As a side note, because Fredable takes a type parameter and can be instantiated to become a specific type, we can do a version of this with additional work. We can write:

var r5 Result[Fredable[string]]

However, there's no way to use Addable this way. The Go compiler error messages will tell us this, because we get different ones in each case. Currently this is:

cannot use generic type Fredable[T any] without instantiation
interface contains type constraints

(The error messages might change before Go 1.18 is released.)

The type of r5 is also different and more specific than you might want, although there is a whole question of what 'Result[Addable]' or 'Result[Fredable]' would really mean if Go accepted it. A full discussion of that is for another entry.

GoGenericsTypeInterfaceIssue written at 22:14:44; Add Comment

2022-02-06

Checking out a Git branch further back than the head

Famously, if you want to check out a repository at some arbitrary commit back from the head of your current branch, you normally do this with just 'git checkout <commit>'. I do this periodically when making bug reports in order to verify that one specific commit is definitely the problem. Equally famously, this puts you into what Git calls a 'detached HEAD' state, where Git doesn't know what branch you're on even if the commit is part of a branch, or even part of 'main'.

It's possible to move a branch (including 'main') back to an older commit while staying on the branch. This avoids Git complaints about being in a detached HEAD state and makes 'git status' do useful things like report how many commits you are behind the upstream tip. As far as I know so far, the way you do this is:

git checkout -B main <commit>

As 'git status' will tell you, you can return to the tip from this state by doing 'git pull'. Equivalently, you can do 'git merge --ff-only origin/main', which avoids fetching anything new from your upstream. This second option gives away the limitation of this approach.

The limitation is that you can only do all of this if you don't have any local commits that you rebase on top of the upstream. If you do have local commits, I think that you want to live with being in the detached HEAD state unless you like doing a bunch of work (and I'm assuming here that you can live without your local changes; otherwise life gets more complicated). Doing all of this back and forth movement of what 'main' is smoothly relies on your normal main being the same as origin/main, and that's not the case if you're rebasing local commits on top of origin/main every time you pull it.

(Git has a syntax for 'N commits back from HEAD' as part of selecting revisions (also), but for almost everything I do what I care about is a specific commit I'm picking out of 'git log', not a number of commits back from the tip.)

It's a little bit annoying that you have to specify the branch name in 'git checkout' even though it's the current branch name. As far as I know, Git has no special name you can use for 'the current branch, whatever it's called', although it does have a variety of ways of getting the name of the current branch. If you're scripting this 'back up to a specific commit on a branch', you can use one of those commands, but for use on the fly I'll just remember to type 'main' or 'master' (depending on what the repository uses) or whatever.

(This is one of the Git things that I don't want to have to work out twice. Although Git being Git, it may in time acquire a better way to do this.)

GitCheckoutBranchBack written at 22:59:39; Add Comment

2022-02-05

Go 1.18 won't have a 'constraints' package of generics helpers

If you read a number of writeups of the Go's new support for generics, which is expected to be in Go 1.18, you'll find them mentioning a constraints package in the standard library (for example, Generics in Go). The constraints package is there to give people a pre-defined set of names for various type constraints, like 'only integer types'. Go itself defines two built in constraints, 'any' (which allows anything) and 'comparable' (which allows anything that can be compared for equality); everything else you either have to write yourself using type sets syntax or get from constraints.

Or at least that was the state of affairs until very recently, when the Go developers decided that they didn't want to commit to the constraints package this early in the use of generics. This was discussed in issue #50792, constraints: move to x/exp for Go 1.18, filed by Russ Cox. To quote from the start of that issue:

There are still questions about the the constraints package. To start with, although many people are happy with the name, many are not. On top of that, it is unclear exactly which interfaces are important and should be present and which should be not. More generally, all the considerations that led us to move slices and maps to x/exp apply to constraints as well.

As far as I know, the current version of x/exp/constraints is the same as what was in the Go development source tree until very recently.

On the whole I think this is a good change, because the Go developers take their compatibility promises seriously. Once something is in the Go standard library, its name and API is frozen, so getting both of those right is important. And regardless of what we like to think, people are not always great at either naming or APIs. Sometimes it takes experimentation, actual use, and experience to sort it out, and Go has already had issues with this (most famously context, which was added to the standard library after various APIs that really should take a Context parameter).

The negative side of this change is that it makes life a bit harder and less obvious for people who want to do things with generics that go beyond just a 'comparable' constraint. In turn this seems likely to reduce the number of people doing this, which means that we'll get less such experimentation and use of generics. By making it harder for people to go beyond 'comparable', the Go developers may have made it a self-fulfilling prophecy that people mostly don't. Perhaps this is a good thing, if you want to minimize use of generics in general.

(I'm not sure that the Go developers genuinely like generics, as opposed to just accepting that they're necessary in some form.)

This does show that the Go developers can change their mind about things quite late, since this change comes after the second beta release of Go 1.18. The Go 1.18 version of generics won't really truly be finalized until Go 1.18 is officially released (although major changes seem unlikely at this point because of the amount of code changes that would be required for anything except perhaps disabling generics entirely).

Go118NoConstraintsPackage written at 22:10:25; Add Comment

2022-01-31

Git 2.34 has changed how you configure fast-forward only pulls and rebasing

I'll start with the conclusion and give you the story afterward. Before Git 2.34, if you were tracking upstream repositories and perhaps carrying local changes on top, it was sensible to configure 'pull.ff only' globally (in your ~/.gitconfig) to stop 'git pull' from warning you and then set 'pull.rebase true' in repositories that you had local changes in. As of Git 2.34, this causes a plain 'git pull' to abort in any repository that has local changes. The simplest fix is to also set 'git config pull.ff true' in each repository where you've set pull.rebase. This may become unnecessary in some future Git version, because the Git 2.34 behavior may be a bug.

When Git started out, 'git pull' defaulted to trying to fast-forward what had been fetched and then automatically doing a merge if it wasn't possible (basically to 'git pull -ff', although I'm not sure they're exactly the same). In theory this was what you wanted as a developer; in practice it was easy to get surprise merges that you didn't want in various situations. Soon people began suggesting that you use 'git pull --ff-only' or better yet, configure 'pull.ff only' through git config. At a certain point, the Git developers themselves decided this wasn't the greatest default and began having 'git pull' warn about it and ask you to do something about it (cf).

I started out setting 'pull.ff only' on some repositories as part of my multi-repo workflow or tracking some upstreams. When Git started warning about plain 'git pull' with nothing configured, I upgraded this to a global setting of 'pull.ff only' in my ~/.gitconfig, since I didn't have any repositories where I want to automatically pull and merge. Almost always I'm either exactly tracking an upstream (where I'm going to reset to origin/main if the upstream does something weird) or I'm carrying local changes that I want to automatically rebase on top of the new upstream. For a long time this worked fine, and then I updated to the Fedora 34 version of Git 2.34.1 and suddenly my automatic rebasing on pulls broke (you can read an example in my Fedora bug report).

As of Git 2.34, two things have changed. First, the default behavior of 'git pull' is now to abort if it can't fast-forward the upstream into your local branch (ie, 'git pull --ff-only'); basically, the previous warning has become an error. Second, the configuration setting of 'pull.ff only' now takes priority over 'pull.rebase true' (although not over an explicit --rebase on the command line). If you have both in a repository with things to rebase, you effectively wind up running 'git pull --ff-only', which fails because you have additional local changes that Git thinks would have to be merged. The behavior of 'pull.ff only' here may be an accidental bug and is certainly not historical behavior, but we have to deal with the Git release we get, not the one we'd like.

(This isn't explicitly documented in the release notes (also), although they do talk about a general overhaul of this area. The 2.34 manual pages are not the clearest about the behavior and they don't explicitly say that 'pull.ff only' takes priority, although the git pull manpage somewhat implies that --ff-only conflicts with --rebase.)

As far as I can see, I have two options to get the behavior I want. First, I can continue to set 'pull.ff only' globally, for safety, and then set 'pull.ff true' in any repository where I also set 'pull.rebase true'. I'm not sure that this is completely harmless in old versions of Git, though. The other option, if I'm only going to be using Git 2.34 and later and I trust Git not to make any surprising changes here, is to not set pull.ff globally and set only 'pull.rebase true' in my 'I have changes on top of upstream' repositories that I want to automatically rebase on pulls.

(I've tested both approaches and they appear to work. Carefully reading the Git source code confirms that 'git pull --ff-only' is basically the default, although it's not actually implemented quite that way. See builtin/pull.c for the gory details.)

PS: A nice long history of the changes in this area in 'git pull' from 2020 or so onward is here. There have been a number of steps. The history has already been updated a bit for Git 2.35 and may be updated more in the future, so check in on it for the latest news.

GitPullConfigAndRebase written at 20:50:22; Add Comment

(Previous 10 or go back to January 2022 at 2022/01/25)

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.