When cloning git repos, things go faster if you start from a good base

February 18, 2019

Normally I get a copy of any git repo that I want by directly cloning from its upstream, whatever that is. Generally this goes fast enough and it guarantees that I have an exact copy of the repo, one that's not contaminated by anything else that I might have been doing to any other copy of the repo that I might already have. But generally I'm doing it on a 1G network link. Recently I needed a copy of Guenter Roeck's tree of hwmon updates for the Linux kernel on my home machine, where I had not previously cloned it. My first attempt was a direct clone from kernel.org, and let me tell you, it didn't go all that fast over my DSL link.

Like probably everyone else with a Linux kernel tree, Guenter Roeck's tree is ultimately a descendant from the regular official Linux kernel tree. I already keep a clone of the regular Linux kernel tree at home (and at work), because I wind up referring to it often enough. So, I wondered, what if I started by making a copy of my kernel tree, then added Guenter Roeck's as an additional upstream and fetched it?

Well:

remote: Counting objects: 2269, done.
remote: Compressing objects: 100% (567/567), done.
remote: Total 2269 (delta 1804), reused 2061 (delta 1702)
Receiving objects: 100% (2269/2269), 1.30 MiB | 268.00 KiB/s, done.
Resolving deltas: 100% (1804/1804), done.

Let me assure you that cloning a full Linux kernel tree involves a lot more than 2269 objects and a lot more than 1.3 MiB of data.

Because of git's fundamental nature as a content-addressable data store, in theory this trick works on anything with significant object overlap, not just things that ultimately descend from the same source. In practice this is generally unimportant; almost everything you're going to want to pull this trick on has a common ancestor.

(The possible exception is if two separate groups are maintaining git repos that are converted from something else, such as a Mercurial repo. At least the objects should be identical between the two repos, and if you're lucky maybe the commits as well, despite these repos not having a common git ancestor.)

This feels like an obvious trick now that I've done it once, so I'm probably going to try to do it more. There are some variations of the trick one can probably perform, such as actively changing the 'origin' upstream over to the upstream you really want to be based on and pulling from. My one question about that would be how one cleans up branches (and perhaps tags) that are only found in the repo you started out by cloning from.


Comments on this page:

From 193.219.181.211 at 2019-02-19 00:28:52:

To clean up remote-tracking branches, git fetch --prune; to clean up tags, some config trickery might be required. Normally tags aren't fetched via refspec, they just magically appear, but I suspect adding refs/tags/*:refs/tags/* as a second value of remote.origin.fetch would allow prune to work. (But it would also fetch a lot of single-use tags you don't want.,.)

But you can just avoid that entirely using git clone --reference ~/src/linux https://foo, which tells Git to do a brand new clone but use an extra object store.

(The object store sharing remains configured persistently, so you end up with a..."thin-provisioned" repo, unless you also use --dissociate at the same time.

You can convert a thin repo to a full one by either manually copying/hardlinking the packfiles or running git repack -da, then removing .git/objects/info/alternates. To do the opposite, create that file and shrink the repo via git repack -dla.)

Written on 18 February 2019.
« Why I like middle mouse button paste in xterm so much
The cliffs in the way of adding tests to our Django web app »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Feb 18 21:34:09 2019
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.