Atom feeds and constrained (web) environments

March 25, 2012

Recently, Aristotle Pagaltzis noticed that I had misspelled the filename of this entry. When I renamed it to fix this, the WanderingThoughts Atom feed repeated the entry (and the comments feed repeated the comments on it); this led to a discussion on Atom entry IDs between Aristotle and I that I am now going to surface as an actual entry so I can discuss DWiki's problem with Atom entries at length.

Atom is in general a nice feed format, but it has one awkward requirement: it absolutely demands that you give each of your feed entries a globally unique ID. Anything that parses an Atom feed is entitled to assume that a new ID means a new entry, regardless of anything else (eg, the text exactly duplicating another entry). This requires per-entry metadata, and metadata is one of the deep problems of file-based content engines. Aristotle suggested:

But isn't it feasible to append a header (well, footer) line to an article file containing an ID, the first time the file is found not to have one?

Part of the problem is the practical difficulties of doing this. For instance, you need some sort of locking so that two simultaneous requests for the Atom feed do not both attempt to invent the Atom ID for an entry, and then you get to worry about the user also editing the file at the same time. All of these difficulties are why I would require an explicit 'publication' step if I was writing a new file-based blog engine (I discussed this here).

Beyond those problems, DWiki (the code behind WanderingThoughts) operates in a uniquely constrained environment; it was written to only require read access to the files that it was serving. Partly this is because it might run as a different user (for example, the web server user), and partly this is because I don't like to give web applications that much power and freedom; it's much easier to feel confident about code that writes things only in very constrained and limited ways. Beyond DWiki's specific circumstances I think that this is a good constraint to assume in general for a file-based system, because modifying files on the fly plays badly with things like keeping the files for your website or blog in a VCS repository (which is one of the big attractions of a file-based engine).

In this sort of environment you simply don't have a unique ID for entries. There is nothing that exists in the filesystem that you can safely use, and you have no way to make up an ID yourself and firmly associate it with the entry. Almost the best you can do is use the filename as the unique ID and hope that it changes only rarely. This is pretty much what DWiki does, and that means that on the rare occasions when I have to rename an entry, I violate the Atom spec.

Sidebar: doing better with more work

It's possible to do somewhat better than just using the filename as part of the ID. Ignoring the locking issues for the moment, what you need to do is make up an ID the first time you see an entry and then record the file-to-ID association in a separate file. Using a separate file avoids all of the issues with updating the entry itself, and still allows the user to correct the mapping by hand if they ever have to rename an entry's file.


Comments on this page:

From 66.31.36.20 at 2012-03-25 09:04:55:

What if you made your text editor generate the guid when you started work on the entry?

From 70.31.29.25 at 2012-03-25 11:02:42:

Another possible option would be to use the inode of the file. Even if you edit or rename the file the inode won't change, so your IDs shouldn't either.

So if you take UUIDs, which have the form:

xxxxxxxx-xxxx-4xxx-yxxx-iiiiiiiiiiii

On installation / initialization, the blogging software generates a random "prefix" and fills in all the x's; this can be put in the config file so one can move between systems, but should allow for globally uniqueness to a reasonable degree.

Then, when the Atom feed is generated, the software stat(2)s the file and appends the inode number where the i's are. Any future edits shouldn't changed the inode (or vnode in the case of NFS), but a mv/cp scenario could still arise.

For example, I have habit of making edits in the following way for files:

$ mv file.txt file.txt.2012-03-25
$ cp file.txt.2012-03-25 file.txt

This way the original timestamp of the file isn't changed.

Another possible area of concern is version control software that doesn't do "strict" in-place editing, but deletes and recreates files; ditto for "sed -i".

By cks at 2012-03-25 14:18:59:

The inode number is a terrible choice for a unique identifier because there are any number of things that will change it. It's relatively common for editors to not overwrite in place for safety reasons, I don't think version control systems will overwrite in place if you are checking out files, and if you ever have to do things like restore from backups or move from system to system (or filesystem to filesystem) the inode number is sure to change.

If you can force authors to explicitly include a unique ID in the file you're fine, but I'm assuming that you can't (and you can't necessarily guarantee that they will make the ID unique). Refusing to serve a page at all because its file lacks an Atom ID is rather hostile, and serving a page as HTML but not including it in Atom feeds is a great way to cause all sorts of problems.

Written on 25 March 2012.
« Link: Getting Real About Distributed System Reliability
What it means to become another user on Unix »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sun Mar 25 01:11:33 2012
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.