The 'entry as file' blog engine problem with tags

December 24, 2013

I've recently started feeling a desire for DWiki to support some form of tags for entries (for reasons involving my personal site). This is a problem, because DWiki is very strongly a file based blog engine, one where every entry is a file in a directory hierarchy and everything you see is just a wrapper around that. In fact this is yet another example of the metadata problem that file based engines have in general.

In the abstract the problem with tags is that you want two-way querying on them: you want to know what tags an entry has when displaying the entry and you want to know what entries have a tag when displaying the tag. This is a terrible fit for a pure file based engine, because in such an engine the entry's file is the only source of information about it. There's no problem embedding tag metadata in entries (you can invent a number of schemes for it), but this only lets you handle one of the two queries efficiently. To find all entries that have a tag you have to both find and read all entries; this does not scale, to put it one way.

(In the old days you could sort of shrug and live with a very slow tag to entry generation process on the grounds that very few people were probably ever going to explore through your tags. There are several things wrong with this, but the killer is that on the modern web, everything gets visited. Web spiders grinding your blog into the ground is no fun.)

If you try hard you can use the filesystem as a database to solve this problem; for example, you might make a directory for each tag and then symlink (or note) each entry into its appropriate tag directories. This gives you quick lookup of the tag to entry query (you list the directory) but it requires a manual step that can get out of sync with the actual entry. This is the point where doom starts to descend and you invent schemes like the blog engine automatically maintaining these tag directories as it reads entries.

You can abandon the idea of the entry as the canonical source of all information and maintain tag mapping information in a separate file that the engine reads and turns into the obvious set of internal maps. This has the advantage of being easy and probably efficient (it depends on how big the tag file gets) and if you want you can automatically generate the file from metadata you add to the entries themselves. But it gets away from the purity of the whole entry as file concept; now there's a separate pile of information alongside the entry file.

I don't have any answers here and I've probably missed some options. I'm just thinking out loud so far.

(Right now I'm most inclined towards tag mapping information in a separate file rather than trying to abuse the filesystem as a database. It answers both queries at once and it's less of an ugly hack.)

Comments on this page:

I like the idea of a database with metadata alongside a “purer” data store. Specifically I like it designed as a cache, populated/updated incrementally by the system during regular operation, containing only redundant data. It may be lost or deleted without any loss of data; the contents can be regenerated at any time from the actual source-of-truth data store by a batch process. It is there purely to make some operations efficient that would otherwise be slow or outright infeasible.

But I am fine with the system depending on a properly populated metadata cache in order to yield correct results. I do not even think it is necessary for the system to be able to detect that the cache is incompletely populated, so long as it can be regenerated at will any time.

(I like non-DBMS source-of-truth data stores that revolve around an exposed, toolable storage format, at the expense of querying efficiency (in my case a Git object store rather than the file system), which IMO makes them better suited to be sources of truth. DBMSs optimise in the other direction, so it makes sense to employ them to make queries fast.

(Ad-hoc queries that is. If your queries are set in stone then custom-designing an in-memory data structure for them and a matching traversal algorithm in a low-level language will beat a DBMSby orders of magnitude. But it may also be orders of magnitude more effort, so may well not be justified. (Other times it is.)))

Written on 24 December 2013.
« A good reason to use write-intent bitmaps
Procedures are not documentation »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Tue Dec 24 02:08:59 2013
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.