Metadata that you can't commit into a VCS is a mistake (for file based websites)

November 2, 2018

I'll start with my tweet and @rt2800pci1's (first) reply:

@thatcks: I like having a file-based blog engine, but mine does make changing the 'category' of a post somewhat painful and a bit disruptive (it re-appears in syndication feeds). Still, I'm too annoyed by my own mistakes to not do it.

@rt2800pci1: Have you thought of using extended attributes on those files to tag them categorically?

[...]

My immediate mental reaction to this suggestion made me realize a view of mine that I've slowly and painfully formed over the years of operating DWiki, the software behind Wandering Thoughts. To wit:

In a file based website engine, any form of metadata that you can't usefully commit into common version control systems is a mistake.

(Some people would go further and say 'for any website', but I'll stick to file based websites for now.)

Using a file's modification time as the creation date of an entry? A mistake (that I've made). Using extended attributes to store tags or categories or other information? Again, a mistake. Having a SQLite database be the master source of information for anything? A mistake. Putting important entry information into essentially opaque JSON blobs that you can't read or edit by hand? A mistake (you can commit them, but you can't do many useful things in the VCS with them, such as diffing two copies).

Basically, the master version of everything should be in human readable plain text. I will somewhat reluctantly accept YAML as sufficiently close, and probably also nicely formatted JSON, but that's about it. You can compile all of this master information into efficient binary forms (as an SQLite database or whatever), but the compiled binary form should not be the canonical master form; it should be an optimization that you can recreate on demand. Similarly, if you use any filesystem metadata (either because it's convenient or because it's necessary), it should be created or set from text-based versions of the same information.

I come to this view the hard way. DWiki uses a bunch of file metadata for various things, and this has caused me any number of problems. The specific one that led to my tweet is that both the 'category' of an entry and its Atom syndication ID are based on its path in the filesystem; if I realize I made a mistake in the category of an entry and fix it, the entry's going to appear as a new entry in my syndication feed (under its new name). However the long standing one is that DWiki uses the file modification time for an entry as when it was written, which means in practice that I can't keep DWiki pages in a VCS and leads to various other hacks (since I sometimes need to update entries).

(This issue of Atom syndication IDs has come up before.)

Using file and filesystem metadata in your file based blog or website engine has an obvious and immediate attraction; it feels neat, clever, and appropriately Unixy. It's just that, in practice, it's a mistake (for several reasons) and over the long term it will bite you on the rear.

(Not using file metadata is one of one of the things I would now do differently in a file based blog engine.)

PS: Using the filesystem as a database is also a mistake in my opinion. It doesn't entirely violate the 'everything can be committed' principle, because VCSes will capture directory hierarchy state, but it's not really in a format that they like and deal well with.

Written on 02 November 2018.
« In Linux, hitting a strict overcommit limit doesn't trigger the OOM killer
My view on Debian versus Ubuntu LTS for us today »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Nov 2 22:30:11 2018
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.