The needs of Version Control Systems conflict with capturing all metadata

November 11, 2018

In a comment on my entry Metadata that you can't commit into a VCS is a mistake (for file based websites), Andrew Reilly put forward a position that I find myself in some sympathy with:

Doesn't it strike you that if your VCS isn't faithfully recording and tracking the metadata associated with the contents of your files, then it's broken?

Certainly I've wished for VCSes to capture more metadata than they do. But, unfortunately, I've come to believe that there are practical issues for VCS usage that conflict with capturing and restoring metadata, especially once you get into advanced cases such as file attributes. In short, what most users of a VCS want are actively in conflict with the VCS being a complete and faithful backup and restore system, especially in practice (ie, with limited programming resources to build and maintain the VCS).

The obvious issue is file modification times. Restoring file modification time on checkout can cause many build systems (starting with make) to not rebuild things if you check out an old version after working on a recent version. More advanced build systems that don't trust file modification timestamps won't be misled by this, but not everything uses them (and not everything should have to).

More generally, metadata has the problem that much of it isn't portable. Non-portable metadata raises multiple issues. First, you need system-specific code to capture and restore it. Then you need to decide how to represent it in your VCS (for instance, do you represent it as essentially opaque blobs, or do you try to translate it to some common format for its type of metadata). Finally, you have to decide what to do if you can't restore a particular piece of metadata on checkout (either because it's not supported on this system or because of various potential errors).

(Capturing certain sorts of metadata can also be surprisingly expensive and strongly influence certain sorts of things about your storage format. Consider the challenges of dealing with Unix hardlinks, for example.)

You can come up with answers for all of these, but the fundamental problem is that the answers are not universal; different use cases will have different answers (and some of these answers may actually conflict with each other; for instance, whether on Unix systems you should store UIDs and GIDs as numbers or as names). VCSes are not designed or built to be comprehensive backup systems, partly because that's a very hard job (especially if you demand cross system portability of the result, which people do very much want for VCSes). Instead they're designed to capture what's important for version controlling things and as such they deliberately exclude things that they think aren't necessary, aren't important, or are problematic. This is a perfectly sensible decision for what they're aimed at, in line with how current VCSes don't do well at handling various sorts of encoded data (starting with JSON blobs and moving up to, say, word processor documents).

Would it be nice to have a perfect VCS, one that captured everything, could restore everything if you asked for it, and knew how to give you useful differences even between things like word processor documents? Sure. But I can't claim with a straight face that not being perfect makes a VCS broken. Current VCSes explicitly make the tradeoff that they are focused on plain text files in situations where only some sorts of metadata are important. If you need to go outside their bounds, you'll need additional tooling on top of them (or instead of them).

(Or, the short version, VCSes are not backup systems and have never claimed to be ones. If you need to capture everything about your filesystem hierarchy, you need a carefully selected, system specific backup program. Pragmatically, you'd better test it to make sure it really does back up and restore unusual metadata, such as file attributes.)


Comments on this page:

By Greg A. Woods at 2018-11-16 15:51:49:

Indeed, "VCSes are not backup systems" (with perhaps the exceptions being one or two mostly unsuccessful experiments done decades ago).

I've been trying to say this for literally decades now. It was probably only a few days or weeks after I first encountered SCCS and RCS that I also encountered people trying in vain to use them as backup tools with versioning capability.

Those exceptions I mentioned were actually full filesystems which recorded the history of every change, or at least all specified changes. We do have some few modern filesystems that can do snapshots, but of course such a feature on its own isn't really sufficient for the filesystem to be called a VCS (even if you're willing and able to make each project directory a separate filesystem).

Most VCSs are, since the beginning and including all I know of in use today, designed to store source code (i.e. human and machine readable data, which mostly means text files), not arbitrary files, and especially not filesystem metadata. Indeed though modern VCSs are able to store binary data, it soon becomes apparent to anyone exploring them in depth that one rarely, if ever, wants to store product files of any kind in the VCS. Even when this is done for good reason it's usually done by proxy, e.g. "git annex".

So, the answer to trying to use a VCS to store filesystem metadata requires translating that metadata into some form of executable description of it, and then storing that description (e.g. a makefile or script or program which creates/changes filesystem metadata), and arranging to run this description after checking out a specific version of a project.

Written on 11 November 2018.
« OpenSSH 7.9's new key revocation support is welcome but can't be a full fix
Easy configuration for lots of Prometheus Blackbox checks »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sun Nov 11 18:40:40 2018
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.