What I would now do differently in a file based blog engine

July 27, 2010

I recently mentioned in passing that if I was building a file based blog engine today, I would do a number of things differently than they are now. Given that file based blog engines are eternally popular, here's my list of those things.

First and foremost, I would not use file metadata for anything, most especially the file date. Using the date of an entry's file as the entry's date is superficially attractive but in practice it complicates your life quite a bit and it clashes badly with using pretty much any sort of modern version control, none of which will preserve that sort of file metadata.

This means that we need somewhere to put the necessary entry metadata. I would require as little metadata as possible to be in the actual files for entries, although I would allow it to be optionally placed there. Most metadata should instead be in plain text database files that are primarily maintained by the blog engine (but which are human editable). To the extent that metadata goes in the actual files, it should go at the end of the file, not at the start, in order to keep the clutter down; after all, the most important thing about an entry is the text you wrote, so it should be at the start of the file.

(Note that I don't consider the entry's title to be metadata in this sense. Metadata is things like the publication date, the unique ID of this entry, and so on.)

Making an entry public would require an explicit publication step via running a command. Files that haven't been published are only visible to authenticated blog users, which would be the blog's author or authors, and have no metadata (or have fake metadata based on bad ideas, like the file modification time). There are two reasons for this; first, you need some way to view drafts through the blog engine without showing them to everyone, and second, requiring an explicit publication step gives the blog engine a chance to collect and make up the metadata about your new entry.

Blog entry metadata is both very important and hard. Doing a well performing blog engine means that you need to collect this information and put it in a single place, so that you do not have to walk all over the filesystem to work it out. In addition, some of the metadata is annoying for people to come up with when the computer can perfectly well either make it up itself (eg, unique entry identifiers suitable for use in Atom syndication feeds) or use a usually correct default value (eg, the publication date, which is usually going to be 'right when I ran the publication command'). All of this means that the sensible thing to do is to collect it once and record it somewhere (and then provide a way of updating that record if and when necessary).

(Some metadata is still going to have to be built by hand, such as the relationships between entries that you need to do good 'related posts' sections. Such metadata should not be embedded into entries but stored in separate files, so that it's easy to modify by itself.)

While the blog engine's internal database can be in whatever format is most convenient for it, the canonical versions of the information should be in human editable plain text files; the publication command updates these files as necessary. Among other things, this means that you can sensibly keep all of your blog's crucial files under version control. You'll want to commit every time you publish or republish, but you'll want to do that anyways; one way or another you need to capture the changes you've made.

(An entry's unique ID is sufficiently important that I would add it to the entry's file. Among other things, this means that the blog engine can recover from you renaming or moving just the entry file without moving or updating metadata tracking files to go along with it.)

In terms of URL structure, I am relatively convinced that I don't want the entry's file name and directory hierarchy to have much, if anything, to do with the entry's URL; good human usable file names make for at best so-so URLs, as you can see all the time here on WanderingThoughts. This implies that the publication command should normally make up the URL of the entry based on something useful such as the original title of the entry (or perhaps random numbers).

Written on 27 July 2010.
« Why sysadmins almost never replace distribution packages
My Fedora 8 problem: upgrading »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Tue Jul 27 00:57:25 2010
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.