Wandering Thoughts archives

2011-04-10

The evolution of the git tree format

This is a second-hand war story. Second hand because it's not my war story; I was just lucky enough to be reading the right mailing lists during the early days of Linus Torvalds developing git.

(One of the occasional privileges of hanging around the mailing lists for various open source projects is getting to see people evolve designs on the fly as they learn more about the problem that they're tackling. To be clear, I mean this non-sarcastically; it's not often that you get to see highly skilled developers refine something before your eyes, and the experience is very useful if you pay attention.)

A git commit captures, among other things, the state of the directory tree at that point. Conceptually, the state of the tree is just a list of all of the filenames in the tree with the SHA1 hashes of their current contents (sorted into some canonical order). Git calls this a 'tree'.

In the early versions of git, the internal representation of a tree was literally this simple; it was a file with all of the filenames and their SHA1 hashes (and their permissions and so on). This worked fine for git itself, but when Linus started trying to apply this to larger things (like say the Linux kernel) it rapidly became obvious that there was a drawback to this simple approach. A list of all of the files in the Linux kernel is a pretty big thing, and most commits change only a very few files, so commits were creating big new tree objects where almost all of the contents were the same as the previous version of the tree. This was a very inefficient representation.

(It's not just an issue of disk space. Unnecessarily large tree objects take longer to generate and process and compare and so on; the code spends a lot of time doing pointless work.)

Linus's solution was to make tree objects hierarchical, containing both filenames and pointers to (sub-)tree objects to represent subdirectories. This means that a commit's tree can reuse the sub-tree objects for all of the subdirectories that haven't changed from the last commit (well, from any commit, really); for typical commits, almost all of the tree is the same as the last time around so almost all of the objects are reused as-is. And the bits of the tree representation that do change are relatively small, since individual directories tend to not be very large.

(When a file changes, every tree object between it and the root of the repository also has to change. The file's subdirectory tree object changes because the file's SHA1 has changed, the tree object of the subdirectory's parent has to change because now the subdirectory tree object has a different SHA1 itself, and so on up to the root.)

A disclaimer: this is how I remember things going. I have to admit that I haven't gone back to the appropriate mailing list archives to double-check that my memory is completely correct here.

Sidebar: some numbers on tree sizes

A null-separated list of all of the files in a current Linux kernel tree is over a megabyte; a git tree version of this would be larger, since it also needs to encode SHA1s and file permissions. Even compressed with gzip this file list is over 180 KB.

tech/GitTreeEvolution written at 01:55:14; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.