Atom feeds and constrained (web) environments
Recently, Aristotle Pagaltzis noticed that I had misspelled the filename of this entry. When I renamed it to fix this, the WanderingThoughts Atom feed repeated the entry (and the comments feed repeated the comments on it); this led to a discussion on Atom entry IDs between Aristotle and I that I am now going to surface as an actual entry so I can discuss DWiki's problem with Atom entries at length.
Atom is in general a nice feed format, but it has one awkward requirement: it absolutely demands that you give each of your feed entries a globally unique ID. Anything that parses an Atom feed is entitled to assume that a new ID means a new entry, regardless of anything else (eg, the text exactly duplicating another entry). This requires per-entry metadata, and metadata is one of the deep problems of file-based content engines. Aristotle suggested:
But isn't it feasible to append a header (well, footer) line to an article file containing an ID, the first time the file is found not to have one?
Part of the problem is the practical difficulties of doing this. For instance, you need some sort of locking so that two simultaneous requests for the Atom feed do not both attempt to invent the Atom ID for an entry, and then you get to worry about the user also editing the file at the same time. All of these difficulties are why I would require an explicit 'publication' step if I was writing a new file-based blog engine (I discussed this here).
Beyond those problems, DWiki (the code behind WanderingThoughts) operates in a uniquely constrained environment; it was written to only require read access to the files that it was serving. Partly this is because it might run as a different user (for example, the web server user), and partly this is because I don't like to give web applications that much power and freedom; it's much easier to feel confident about code that writes things only in very constrained and limited ways. Beyond DWiki's specific circumstances I think that this is a good constraint to assume in general for a file-based system, because modifying files on the fly plays badly with things like keeping the files for your website or blog in a VCS repository (which is one of the big attractions of a file-based engine).
In this sort of environment you simply don't have a unique ID for entries. There is nothing that exists in the filesystem that you can safely use, and you have no way to make up an ID yourself and firmly associate it with the entry. Almost the best you can do is use the filename as the unique ID and hope that it changes only rarely. This is pretty much what DWiki does, and that means that on the rare occasions when I have to rename an entry, I violate the Atom spec.
Sidebar: doing better with more work
It's possible to do somewhat better than just using the filename as part of the ID. Ignoring the locking issues for the moment, what you need to do is make up an ID the first time you see an entry and then record the file-to-ID association in a separate file. Using a separate file avoids all of the issues with updating the entry itself, and still allows the user to correct the mapping by hand if they ever have to rename an entry's file.
Link: Getting Real About Distributed System Reliability
Jay Kreps' Getting Real About Distributed System Reliability is a very interesting discussion of the reliability of distributed systems in the real world. He patiently explains that a number of assumptions normally made to reason about this are in fact wrong in practice, especially the assumption that failures are independent. I'm not going to try to summarize his entry beyond that; go read it instead.
(I suspect that his logic extends to all real systems, not just distributed ones, and in any case he has given me a lot to think about.)
By the way, several of the links in his entry are themselves worth following and reading carefully.
(I believe I got this from my Twitter stream but I cannot find the original source now.)