Metadata that you can't commit into a VCS is a mistake (for file based websites)

November 2, 2018

I'll start with my tweet and @rt2800pci1's (first) reply:

@thatcks: I like having a file-based blog engine, but mine does make changing the 'category' of a post somewhat painful and a bit disruptive (it re-appears in syndication feeds). Still, I'm too annoyed by my own mistakes to not do it.

@rt2800pci1: Have you thought of using extended attributes on those files to tag them categorically?

[...]

My immediate mental reaction to this suggestion made me realize a view of mine that I've slowly and painfully formed over the years of operating DWiki, the software behind Wandering Thoughts. To wit:

In a file based website engine, any form of metadata that you can't usefully commit into common version control systems is a mistake.

(Some people would go further and say 'for any website', but I'll stick to file based websites for now.)

Using a file's modification time as the creation date of an entry? A mistake (that I've made). Using extended attributes to store tags or categories or other information? Again, a mistake. Having a SQLite database be the master source of information for anything? A mistake. Putting important entry information into essentially opaque JSON blobs that you can't read or edit by hand? A mistake (you can commit them, but you can't do many useful things in the VCS with them, such as diffing two copies).

Basically, the master version of everything should be in human readable plain text. I will somewhat reluctantly accept YAML as sufficiently close, and probably also nicely formatted JSON, but that's about it. You can compile all of this master information into efficient binary forms (as an SQLite database or whatever), but the compiled binary form should not be the canonical master form; it should be an optimization that you can recreate on demand. Similarly, if you use any filesystem metadata (either because it's convenient or because it's necessary), it should be created or set from text-based versions of the same information.

I come to this view the hard way. DWiki uses a bunch of file metadata for various things, and this has caused me any number of problems. The specific one that led to my tweet is that both the 'category' of an entry and its Atom syndication ID are based on its path in the filesystem; if I realize I made a mistake in the category of an entry and fix it, the entry's going to appear as a new entry in my syndication feed (under its new name). However the long standing one is that DWiki uses the file modification time for an entry as when it was written, which means in practice that I can't keep DWiki pages in a VCS and leads to various other hacks (since I sometimes need to update entries).

(This issue of Atom syndication IDs has come up before.)

Using file and filesystem metadata in your file based blog or website engine has an obvious and immediate attraction; it feels neat, clever, and appropriately Unixy. It's just that, in practice, it's a mistake (for several reasons) and over the long term it will bite you on the rear.

(Not using file metadata is one of one of the things I would now do differently in a file based blog engine.)

PS: Using the filesystem as a database is also a mistake in my opinion. It doesn't entirely violate the 'everything can be committed' principle, because VCSes will capture directory hierarchy state, but it's not really in a format that they like and deal well with.


Comments on this page:

I started writing a comment in reply but it turned into an entry on my own weblog. (And thanks for compelling me to publish something there again after ages! I should do that more often…)

By Andrew Reilly at 2018-11-11 17:05:01:

Doesn't it strike you that if your VCS isn't faithfully recording and tracking the metadata associated with the contents of your files, then it's broken?

Now it may be that it's a common form of breakage, because most VCS grew up around program source code, and that generally only cares about the file contents. Still. Metadata exists. It should be saved for posterity and versioned.

By cks at 2018-11-11 19:14:31:

This is a sufficiently good question that I wound up writing an entry about VCSes versus metadata, but the short version is that I don't think VCSes are broken here because capturing all metadata isn't part of their job (especially since current VCSes are primarily focused on code development and its needs).

Written on 02 November 2018.
« In Linux, hitting a strict overcommit limit doesn't trigger the OOM killer
My view on Debian versus Ubuntu LTS for us today »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Nov 2 22:30:11 2018
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.