Why 'file as blog entry' blog engines have problems

June 7, 2008

One enduringly popular model for blogging engines is that entries will be files in the filesystem, and the blog engine will just wrap them up in various simple ways. This approach has a clear and attractive simplicity and thus an obvious appeal, but as I have found out from following this route myself this simplicity hides a number of subtle problems.

All of the problems can be summarized in one word: metadata. Blog entries have (or need) quite a lot of of metadata associated with them, and making files your entries does not give you very many good places to put this metadata:

  • you can infer the metadata from things surrounding the file itself; for example, you can infer the entry's publication date from the file's last modification time.

    There are two problems with this: first, this doesn't cover all of the metadata you need, and second, this creates awkward problems when the blog engine's use of file metadata clashes with things you want to do with the file, such as updating an entry without changing its publication date.

  • you can embed the metadata in the file itself, but this clutters up the file contents and makes authoring more annoying.

    (If you go this route, I suggest putting the metadata at the end of the file, not the start, so that it is at least less obtrusive.)

  • you can have your blog engine invent or learn the necessary metadata the first time it sees a new entry and record it somewhere else so that it's stable.

    The problem here is that you've effectively created a database (or you're using a real one), with all of the associated management issues, but things are half in the database and half outside for extra fun.

One popular additional non-answer is that you can decide to ignore the need for certain sorts of metadata; however, this will limit what your blog engine can do and periodically cause explosions.

(Ironically, a stable master identifier for entries is one of the easier things to arrange; if the filename is otherwise meaningless, and it probably is since you can't really use it as the entry title, you can just use it for the master identifier. Of course this will give you a blog directory full of peculiarly named things that you can never rename, but that's life without metadata.)


Comments on this page:

From 83.10.163.90 at 2008-06-07 05:26:59:

Or you can ignore broken filesystems and use extended attributes.

Afaik vfat, reiser4 and ISO 9660 are the only 'broken' filesystems that are still used. All of them are easily replaceable so we can stop worrying about that.

PS why can't I use post titles as filenames?

By cks at 2008-06-07 11:21:48:

I believe that NFS doesn't support extended attributes. In general, I'm not really enthused about them, because my impression is that too many things are still immature, such as easy access from various languages, full support in tools like cp and rsync and tar, and their compatibility across various systems (important if someone's blog may someday move from Linux to FreeBSD, for example).

By cks at 2008-06-07 11:29:52:

As for post title as filenames, there's two problems. First, you get long filenames with all sorts of characters in them that are awkward to handle in a Unix shell. Second, it means that you can't put a / in the title, which at a minimum rules out HTML markup and entry titles like this or this.

From 83.10.164.242 at 2008-06-07 14:05:39:

One can use sanitized names, you have to do this for URLs anyway. And a directory full of 'on-the-various-meanings-of-tag' is way better than any database or other binary blob.

Regarding to extended attributes, i knew only solaris and freebsd, i didn't know how awful it's on linux (no support from coreutils, only a broken special purpose tool, on ext3 everything must fit in one block etc.). It solves most problems, but using it is pita.

From 83.10.164.242 at 2008-06-07 14:11:16:

Oh, and nfs handles extended attributes nicely. After some googling i think linux nfs understands them as well.

By cks at 2008-06-07 17:44:51:

I care a lot about friendly, usable filenames because my feeling is that the less you make the filenames something that people can write and use directly, the less advantage you are getting out of file-based storage. At the farthest extreme, the filenames are so unfriendly that you write your entries using scratch names and then run a script to publish them, so you might as well use a real database; the user is never manipulating your blog entry storage format anyways.

(It might make sense for that real database to use file based storage for the blog entries, but at that point it's an internal implementation detail.)

From 87.103.13.7 at 2008-06-10 04:08:03:

Well, honestly, I haven't had any trouble at all with my approach:

http://code.google.com/p/yaki/source/browse/trunk/yaki/webapps/ROOT/space/docs/Storage%20Format/index.txt

The metadata is readable, everything is indexable, and I can move stuff around without caring for breakage.

Mind you, I have an intermediate indexing step that avoids all the messy stuff.

Rui Carmo

From 138.38.56.53 at 2008-06-10 06:13:55:

Presumably there is nothing stopping you from making each file an Atom entry.

Phil Wilson

By cks at 2008-06-11 23:21:08:

The appeal of the file-based model is that you can hand author files and don't need to use a complex environment. Atom entries cannot be sanely written by hand, at least not by normal mortals, so going with 'each file is an Atom entry' basically kills the main advantage of using files at all.

On Yaki: right now it is missing at least one crucial piece of metadata for a blog engine specifically, namely a stable identifier in the proper format to be used in Atom syndication feeds and so on. You don't need this if you're just casually doing syndication feeds, but if syndication feeds are a big part of your software (and I maintain that they should be for blog engines), having unstable identifiers causes various sorts of heartache down the road when you want to do things like reorganize your physical file layout and names.

(It's also not clear how it creates permanent stable links to an entry.)

Written on 07 June 2008.
« Why shells should have small programming languages
Recovering my Eee PC from a post-update problem »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sat Jun 7 00:34:01 2008
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.