VCS history versus large open source development

October 19, 2021

I recently read Fossil's Rebase Considered Harmful (via), which is another rerun of the great rebase versus everything else debate. This time around, one of the things that occurred to me is that rebasing and an array of similar things allow maintainers of large, public open source repositories to draw a clean line between how people develop changes in private and what appears in the immutable public history of the project. Any open source project can benefit from clean public history, partly because clean history makes it easy to use bisection to locate bugs, but a large project especially benefits because it has so many contributors of varying skill levels and practices.

(In addition, consumers of public open source repositories often already see a linear view of the project's code history.)

Another aspect of using rebasing and other things that erase history (such as emailed patch series) is that they free people to develop changes in whatever style of VCS usage they find most comfortable and useful. You can set your editor to make a commit every time you save a file, and no one else has to care in the way they very much would if you proposed to merge the entire sequence intact into a large, public open source repository. The more contributors you have (and the more disparate they are), the more potentially useful this is.

Of course, there's a continuum, both between projects and in general. It's undeniably sometimes useful to know how a change was developed over time, for various reasons. It can also be useful to know how a change has flowed through various public versions of the code. The Linux kernel famously has a whole collection of trees that changes can wind up in before they get pulled into the mainline, and when this is done the changes often continue to carry their history of trees. Presumably this is useful to Linus Torvalds and other kernel developers.

One way to put this is that as an open source project grows larger and larger, I think that it makes less and less sense to try to represent almost everything that happens to the project in its VCS history. VCS history is only one way to capture and handle the entire history of the project; using it for everything has the same sort of broad problems that using any single thing for everything has. Perhaps the larger your project is, the more you should be explicitly asking what your VCS history is for and how you want it to be used (and to be useful).


Comments on this page:

By Walex at 2021-10-20 05:00:12:

«it makes less and less sense to try to represent almost everything that happens to the project in its VCS history. VCS history is only one way to capture and handle the entire history of the project»

What is this "VCS history" that this post is about? I think that all currently popular VCSes have per-"branch" history, but have no "VCS history" or repository history.

The "great rebase versus everything else debate" is really about whether the main release/public branch should be based on merges or on rewriting. Whatever happens on the oher branches, in particular local branches, does not matter a lot.

The debate in my interpretation is: rewriting history is perfectly fine if done with great care and discipline and is well documented, it is not so fine if done messily.

For 'git', which is the 'rebase' debate nexus, all of its design is based on enabling a workflow where one very skilled, disciplined, careful, editor receives, reviews and curates a lot of random updates from random people into one edited work and I guess 'rebase' is good for that workflow.

I think that all currently popular VCSes have per-“branch” history, but have no “VCS history” or repository history.

I’m not sure what exactly this remark is aiming at, so I can’t be sure, but Git does have what it calls reflogs, which might cover that, fully or at least partly.

By cks at 2021-10-20 13:18:13:

What I mean by a project's VCS history in this entry is the commit or change history visible in the project's public VCS repository or repositories, whether those have been minimized and cleaned up through means like rebases or fully represent all of the original changes ever made by contributors.

From there, if you take the full purist "never rebase" view there's no such thing as purely local branch history for contributors (well, for changes that did make it to the master version eventually). The actual named local branches may never exist in the project's public VCS, but all the commits from them will (including merge commits to pull in upstream development that happened while contributors developed their changes). This is implied by never doing any form of squashing and always contributing through merges. The only way to trim that history is to go outside the VCS; you throw away one local branch, start another one with a "clean" history, and redo your change there before having upstream merge the now cleaner history.

The less purist view is that people should generally upstream 'clean' versions of their final changes, without the full local history of their creation and evolution. There are a lot of ways of doing this, some of them explicit and others implicit. Rebasing and squashing your local changes before upstream pulls from you is an explicit in-VCS approach; extracting your final diff between upstream and you and sending it by email as a patch is an out of VCS approach.

My view is that as open source projects get larger and larger, it's more and more useful to move toward the non-purist view of their public project VCS history. The resulting visible commits and project history are easier to follow and more useful for bisection, and you don't have to rely so much on contributors always following good VCS practices in their local development before they ask you to pull from them. The project's public VCS trees become more and more a curated experience for an increasing number of consumers, including developers who no longer have the time to follow every step in every change being developed and would anyway only look at the final diffs, not the detailed history.

By Sam Birch at 2021-10-24 18:26:41:

It saddens me that Mercurial's changeset evolution arrived so late and is so poorly understood. IMO it really is the best of both worlds.

By cks at 2021-10-24 20:47:05:

My view is that changeset evolution solves only the technical part of the problem for large open source projects, although it's great to have a technical solution. The social part is that, to put it one way, Linus Torvalds probably doesn't want flailing attempts at changeset development to appear in his tree at all, even if they're cordoned off into the corner. Having everything in VCS history in some form means that a large open source project will inevitably have a lot of crud in there, and they have good reasons to want crud-free trees.

Written on 19 October 2021.
« The cut and paste irritation in "smart" in-browser text editing
In the beginning, there was no way to expand C's stack size »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Tue Oct 19 23:24:48 2021
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.