URLs are terrible permanent identifiers for things

May 25, 2017

I was recently reading the JSON Feed version 1 specification (via Trivium, among other places). I have a number of opinions on it as a syndication feed format, but that's not the subject of today's entry, because in the middle of the specification I ran into the following bit (which is specifically talking about the elements of feed entries, ie posts):

  • id (required, string) is unique for that item for that feed over time. [...] Ideally, the id is the full URL of the resource described by the item, since URLs make great unique identifiers.

(Emphasis mine.)

When I read this bit, I had an immediate pained reaction. As someone who has been running a blog for more than ten years and has made this exact mistake, let me assure you that URLs make terrible permanent unique identifiers for things. Yes, yes, cool URLs don't change, as the famous writeup says. Unfortunately in the real world, URLs change all of the time. One reason for this that is especially relevant right now is that URLs include the protocol, and right now the web is in the process of a major shift from HTTP to HTTPS. That shift just changed all your URLs.

(I think that over the next ten years the web will wind up being almost entirely HTTPS, even though much of it is not HTTPS today, so quite a lot of people will be going through this URL transition in the future.)

This is not the only case that may force your hand. And beyond more or less forced changes, you may someday move your blog from one domain to another or change the real URLs of all of your entries because you changed the blog system that you use (both of which has happened). In theory you can create a system to generate syndication feeds that deals with all of that, by having a 'URL for id' field of some sort (perhaps automatically derived from your configuration of URL redirections), but if you're going to wind up detaching what you put in the id field from the actual canonical URL of the entry, why not make it arbitrary in the first place? It will save you a bunch of pain to do this from the start.

(Please trust me on this one, seeing as this general issue has caused me pain. As my example illustrates, using any part of the URL as part of your 'permanent identifier' is going to cause you heartburn sooner or later.)

There are excellent reasons why the Atom syndication format both explicitly allows for and more importantly encourages various forms of permanent identifiers for feed entries that are not URLs. For example, you can use UUIDs (as 'urn:uuid:<uuid>') or your own arbitrary but unique identifier in your own namespace (as tag: URNs). The Atom format does this because the people who created it had already run into various problems with the widespread use of URLs as theoretically permanent entry identifiers in RSS feeds.

Written on 25 May 2017.
« Exploiting Python's Global Interpreter Lock for atomic operations is fun
Using Linux's Magic Sysrq on modern keyboards without a dedicated Syrq key »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Thu May 25 01:26:24 2017
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.