The git version control system as a creation of the modern age

March 26, 2009

It has recently struck me that git (the distributed version control system that Linus Torvalds created) is very much a creation of the modern age. By that I don't mean that it was created recently or by modern, Internet-enabled methods; I mean that it only makes sense in the modern era of really cheap, really large disk space.

Earlier version control systems were created when disk space was much more expensive and not all that large relative to your source code. In that era, wasting disk space was a sin and a serious issue, so the version control systems spent a lot of effort (and gave up a lot of speed) in order to have efficient storage of closely related versions of the same file, using interleaved fragments (SCCS) or reverse deltas (RCS) or the like. (Then people agonized about how storing binary files made these schemes blow up, and invented binary delta formats, and so on.)

In the modern era, disks are huge relative to your source base and all of those complex compression schemes are premature optimizations and more or less pointless. So git started out by throwing the whole issue out the window and just taking the brute force approach of storing full files. Change one line? You stored another full copy of the file. In the modern age, the extra disk space didn't really matter, not compared to the simplifications that you gained from this approach. In the old era that would have been unthinkable, and the lack of 'efficient storage' would have gotten git laughed out of the room.

(I suspect that chasing efficient storage had other effects on the overall design of old version control systems. For example, is the obsession with explicitly tracking file renames partly driven by the storage inefficiency of not doing so?)

Comments on this page:

From at 2009-03-26 05:40:55:

Huh? I haven't used git much, but I'm a hg developer, and we surely don't store full copies on every change. I strongly doubt that git does, either. In Mercurial, we store deltas from a base revision, right until the delta becomes almost-as-large as (a new) full copy. Since git does content-based versioning, I expect that it actually only stores the changed content, and some hash structure that tells it how to reconstruct the changed file from that.

Dirkjan Ochtman

By cks at 2009-03-26 12:06:10:

Current versions of git have an optional delta-based 'pack' format for storage, but the core git storage format is one Unix file per repository object (see for example here). This is part of the reason that git can be crazy fast at operations like commits, because they require just a few file copies (and of things that are probably hot in caches, too).

From at 2009-03-26 13:30:53:

I disagree, there were many pre-UNIX filesystems that had file versioning (VMS and the ITS file system come to mind), I don't think these used deltas internally, and they were effectively used for version control, but on a file level. (You had to prune them often, of course.)

What Git changed is that it does content-indexing, which probably would have been too expensive back in the days. Then, Plan9's Venti pretty much is the same.

Still, Git is such an easy concept one wonders why nobody came up with it earlier (I think the packed file algorithm is pretty "recent", but its not like algorithms of this kind didn't exist before).


By cks at 2009-03-26 14:06:40:

As you mention, the difference between pre-Unix filesystems with versioning and git is that the filesystem versioning wasn't used to keep a full history of your files; one frequently pruned old versions (either automatically or manually). If you wanted a full history you had to use some sort of source code management system, and I expect that they used delta formats precisely to save precious space.

(I suspect that file versioning on pre-Unix filesystems was mostly designed for and used for quick and easy recoveries from various sorts of accidents.)

From at 2009-04-07 06:02:56:

First, even though git originally, at the beginning, didn't use any form of delta encoding to reduce disk space (disk is cheap), it still used compression, and it stored only unique different versions of contents of a file (thanks to its content hash addressed design borrowed from Monotone). This storage format is called loose format.

Second, quite soon after it acquired packed format, which uses delta encoding. It uses the same format for network (bandwidth is not as large and as cheap as disk space) as for disk. You should repack your repository from time to time (it nowadays, with modren git, should usually happen automatically) mainly not because it reduces disk space, but because it reduces access time (and wasted disk space and inodes). BTW. delta encoding used by git is XDiffLib binary delta encoding...

Still, I agree that "disk is cheap" mentality allowed for very clear git repository model, without need to build it around deltaification scheme (or equivalent, like weave storage model).

--Jakub Narębski

Written on 26 March 2009.
« Using fully mirrored system disks on Linux
An important difference »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Thu Mar 26 01:05:47 2009
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.