The needs of Version Control Systems conflict with capturing all metadata

November 11, 2018

In a comment on my entry Metadata that you can't commit into a VCS is a mistake (for file based websites), Andrew Reilly put forward a position that I find myself in some sympathy with:

Doesn't it strike you that if your VCS isn't faithfully recording and tracking the metadata associated with the contents of your files, then it's broken?

Certainly I've wished for VCSes to capture more metadata than they do. But, unfortunately, I've come to believe that there are practical issues for VCS usage that conflict with capturing and restoring metadata, especially once you get into advanced cases such as file attributes. In short, what most users of a VCS want are actively in conflict with the VCS being a complete and faithful backup and restore system, especially in practice (ie, with limited programming resources to build and maintain the VCS).

The obvious issue is file modification times. Restoring file modification time on checkout can cause many build systems (starting with make) to not rebuild things if you check out an old version after working on a recent version. More advanced build systems that don't trust file modification timestamps won't be misled by this, but not everything uses them (and not everything should have to).

More generally, metadata has the problem that much of it isn't portable. Non-portable metadata raises multiple issues. First, you need system-specific code to capture and restore it. Then you need to decide how to represent it in your VCS (for instance, do you represent it as essentially opaque blobs, or do you try to translate it to some common format for its type of metadata). Finally, you have to decide what to do if you can't restore a particular piece of metadata on checkout (either because it's not supported on this system or because of various potential errors).

(Capturing certain sorts of metadata can also be surprisingly expensive and strongly influence certain sorts of things about your storage format. Consider the challenges of dealing with Unix hardlinks, for example.)

You can come up with answers for all of these, but the fundamental problem is that the answers are not universal; different use cases will have different answers (and some of these answers may actually conflict with each other; for instance, whether on Unix systems you should store UIDs and GIDs as numbers or as names). VCSes are not designed or built to be comprehensive backup systems, partly because that's a very hard job (especially if you demand cross system portability of the result, which people do very much want for VCSes). Instead they're designed to capture what's important for version controlling things and as such they deliberately exclude things that they think aren't necessary, aren't important, or are problematic. This is a perfectly sensible decision for what they're aimed at, in line with how current VCSes don't do well at handling various sorts of encoded data (starting with JSON blobs and moving up to, say, word processor documents).

Would it be nice to have a perfect VCS, one that captured everything, could restore everything if you asked for it, and knew how to give you useful differences even between things like word processor documents? Sure. But I can't claim with a straight face that not being perfect makes a VCS broken. Current VCSes explicitly make the tradeoff that they are focused on plain text files in situations where only some sorts of metadata are important. If you need to go outside their bounds, you'll need additional tooling on top of them (or instead of them).

(Or, the short version, VCSes are not backup systems and have never claimed to be ones. If you need to capture everything about your filesystem hierarchy, you need a carefully selected, system specific backup program. Pragmatically, you'd better test it to make sure it really does back up and restore unusual metadata, such as file attributes.)

Written on 11 November 2018.
« OpenSSH 7.9's new key revocation support is welcome but can't be a full fix
Easy configuration for lots of Prometheus Blackbox checks »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sun Nov 11 18:40:40 2018
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.