The 'entry as file' blog engine problem with tags

December 24, 2013

I've recently started feeling a desire for DWiki to support some form of tags for entries (for reasons involving my personal site). This is a problem, because DWiki is very strongly a file based blog engine, one where every entry is a file in a directory hierarchy and everything you see is just a wrapper around that. In fact this is yet another example of the metadata problem that file based engines have in general.

In the abstract the problem with tags is that you want two-way querying on them: you want to know what tags an entry has when displaying the entry and you want to know what entries have a tag when displaying the tag. This is a terrible fit for a pure file based engine, because in such an engine the entry's file is the only source of information about it. There's no problem embedding tag metadata in entries (you can invent a number of schemes for it), but this only lets you handle one of the two queries efficiently. To find all entries that have a tag you have to both find and read all entries; this does not scale, to put it one way.

(In the old days you could sort of shrug and live with a very slow tag to entry generation process on the grounds that very few people were probably ever going to explore through your tags. There are several things wrong with this, but the killer is that on the modern web, everything gets visited. Web spiders grinding your blog into the ground is no fun.)

If you try hard you can use the filesystem as a database to solve this problem; for example, you might make a directory for each tag and then symlink (or note) each entry into its appropriate tag directories. This gives you quick lookup of the tag to entry query (you list the directory) but it requires a manual step that can get out of sync with the actual entry. This is the point where doom starts to descend and you invent schemes like the blog engine automatically maintaining these tag directories as it reads entries.

You can abandon the idea of the entry as the canonical source of all information and maintain tag mapping information in a separate file that the engine reads and turns into the obvious set of internal maps. This has the advantage of being easy and probably efficient (it depends on how big the tag file gets) and if you want you can automatically generate the file from metadata you add to the entries themselves. But it gets away from the purity of the whole entry as file concept; now there's a separate pile of information alongside the entry file.

I don't have any answers here and I've probably missed some options. I'm just thinking out loud so far.

(Right now I'm most inclined towards tag mapping information in a separate file rather than trying to abuse the filesystem as a database. It answers both queries at once and it's less of an ugly hack.)

Written on 24 December 2013.
« A good reason to use write-intent bitmaps
Procedures are not documentation »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Tue Dec 24 02:08:59 2013
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.