Wandering Thoughts archives

2020-09-10

My take on permanent versus temporary HTTP redirects in general

When I started digging into the HTTP world (which was around the time I started writing DWiki), the major practical difference between permanent and temporary HTTP redirects was that browsers aggressively cached permanent redirects. This meant that permanent redirects were somewhat of a footgun; if you got something wrong about the redirect or changed your mind later, you had a problem (and other people could create problems for you). While there are ways to clear permanent redirects in browsers, they're generally so intricate that you can't count on visitors to do them (here's one way to do it in Firefox).

(Since permanent redirects fix both that the source URL is being redirected and what the target URL is, they provide not one but two ways for what you thought was permanent and fixed to need to change. In a world where cool URLs change, permanence is a dangerous assumption.)

Also, back then in theory syndication feed readers, web search engines, and other things that care about the canonical URLs of things would use a permanent redirect as a sign to update what that was. This worked some of the time in some syndication feed readers for updating feed URLs, but definitely not always; software authors had to go out of their way to do this, and there were things that could go wrong (cf). Even back in the days I don't know if web search engines paid much attention to it as a signal.

All of this got me to use temporary redirections almost all of the time, even in situations where I thought that the redirection was probably permanent. That Apache and other things made temporary redirections the default also meant that it was somewhat easier to set up my redirects as temporary instead of permanent. Using temporary redirects potentially meant somewhat more requests and a somewhat longer delay before some people with some URLs got the content, but I didn't really care, not when set against the downsides of getting a permanent redirect wrong or needing to change it after all.

In the modern world, I'm not sure how many people will have permanent HTTP redirects cached in their browsers any more. Many people browse in more constrained environments where browsers are throwing things out on a regular basis (ie phones and tablets), browsers have probably gotten at least a bit tired of people complaining about 'this redirect is stuck', and I'm sure that some people have abused that long term cache of permanent redirects to fingerprint their site visitors. On the one hand, this makes the drawback of permanent redirects less important, but on the other hand this makes their advantages smaller.

Today I still use temporary redirects most of the time, even for theoretically permanent things, but I'm not really systematic about it. Now that I've written this out, maybe I will start to be, and just say it's temporary for me for now onward unless there's a compelling reason to use a permanent redirect.

(One reason to use a permanent redirect would be if the old URL has to go away entirely at some point. Then I'd want as a strong as signal as possible that the content really has migrated, even if only some things will notice. Some is better than none, after all.)

PermanentVsTemporaryRedirects written at 23:49:17; Add Comment

Permanent versus temporary redirects when handling extra query parameters on your URLs

In yesterday's entry on what you should do about extra query parameters on your URLs, I said that you should answer with a HTTP redirect to the canonical URL of the page and that I thought this should be a permanent redirect instead of a temporary one for reasons that didn't fit into the entry. Because Aristotle Pagaltzis asked, here is why I think permanent redirects are the right option.

As far as I know, there are two differences in client behavior (including web spider behavior) between permanent HTTP redirects and temporary ones, which is that clients don't cache temporary redirects and don't consider them to change the canonical URL of the resource. If you use permanent redirects, you thus probably make it more likely that web search engines will conclude that your canonical URL really is the canonical URL and they don't need to keep re-checking the other one, at the potential downside of having browsers cache the redirect and never re-check it.

So the question is if you'll ever want to change the redirect or otherwise do something else when you get a request with those extra query parameters. My belief is that this is unlikely. To start with, you're probably not going to reuse other people's commonly used extra query parameters for real query parameters of your own, because other people use them and will likely overwrite your values with theirs.

(In related news, if you were previously using a 's=..' query parameter for your own purposes on URLs that people will share around social media, someone out there has just dumped some pain on top of you. Apparently it may be Twitter instead of my initial suspect of Slack, based on a comment on this entry.)

If you change the canonical URL of the page, you're going to need a redirect for the old canonical URL anyway, so people with the 'extra query parameters' redirect cached in their browser will just get another redirect. They can live with that.

The only remaining situation I can think of where a cached permanent redirection would be a problem would be if you want to change your web setup so that you deliberately react to specific extra query parameters (and possibly their values) by changing your redirects or rendering a different version of your page (without a redirect). This strikes me as an unlikely change for most of my readers to want to make (and I'm not sure how common customizing pages to the apparent traffic source is in general).

(Also, browsers don't cache permanent redirects forever, so you could always turn the permanent redirects into temporary ones for a few months, then start doing the special stuff.)

PS: I don't think most clients do anything much about changing the 'canonical URL' of a resource if the initial request gets a permanent redirect. Even things like syndication feed readers don't necessarily update their idea of your feed's URL if you provide permanent redirects, and web browsers are even less likely to change things like a user's bookmarks. These days, even search engines may more or less ignore it, because people do make mistakes with their permanent redirects.

HandlingExtraQueryParametersII written at 00:10:04; Add Comment

2020-09-08

What you should do about extra query parameters on your URLs

My entry on how web server laxness created a de facto requirement to accept arbitrary query parameters on your URLs got a number of good comments, so I want to agree with and magnify the suggestion about what to do about these parameters. First off, you shouldn't reject web page requests with extra query parameters. I also believe that you shouldn't just ignore them and serve the regular version of your web page. Instead, as said by several commentators, you should answer with a HTTP redirect to the canonical URL of the web page, which will be stripped of at least the extra query parameters.

(I think that this should be a permanent HTTP redirect instead of a temporary one for reasons that don't fit within the margins of this entry. Also, this assumes that you're dealing with a GET or a HEAD request.)

Answering with a HTTP redirect instead of the page has two useful or important effects, as pointed out by commentators on that entry. First, any web search engines that are following those altered links won't index duplicate versions of your pages and get confused about which is the canonical one (or downrate you in results for having duplicate content). Second, people who copy and reshare the URL from their browser will be sharing the canonical URL, not the messed up version with tracking identifiers and other gunk. This assumes that you don't care about those tracking identifiers, but I think this is true for most of my readers.

(In addition, you can't count on other people's tracking identifiers to be preserved by third parties when your URLs get re-shared. If you want to track that sort of stuff, you probably need to add your own tracking identifier. You might care about this if, for example, you wanted to see how widely a link posted on Facebook spread.)

However, this only applies to web pages, not to API endpoints. Your API endpoints (even GET ones) should probably error out on extra query parameters unless there is some plausible reason they would ever be usefully shared through social media. If your API endpoints never respond with useful HTML to bare GETs, this probably doesn't apply. If you see a lot of this happening with your endpoints, you might make them answer with HTTP redirects to your API documentation or something like that instead of some 4xx error status.

(But you probably should also try to figure out why people are sharing the URLs of your API endpoints on social media, and other people are copying them. You may have a documentation issue.)

PS: As you might suspect, this is what DWiki does, at least for the extra query parameters that it specifically recognizes.

HandlingExtraQueryParameters written at 23:41:59; Add Comment

2020-09-07

URL query parameters and how laxness creates de facto requirements on the web

One of the ways that DWiki (the code behind Wandering Thoughts) is unusual is that it strictly validates the query parameters it receives on URLs, including on HTTP GET requests for ordinary pages. If a HTTP request has unexpected and unsupported query parameters, such a GET request will normally fail. When I made this decision it seemed the cautious and conservative approach, but this caution has turned out to be a mistake on the modern web. In practice, all sorts of sites will generate versions of your URLs with all sorts of extra query parameters tacked on, give them to people, and expect them to work. If your website refuses to play along, (some) people won't get to see your content. On today's web, you need to accept (and then ignore) arbitrary query parameters on your URLs.

(Today's new query parameter is 's=NN', for various values of NN like '04' and '09'. I'm not sure what's generating these URLs, but it may be Slack.)

You might wonder how we got here, and that is a story of lax behavior (or, if you prefer, being liberal in what you accept). In the beginning, both Apache (for static web pages) and early web applications often ignored extra query parameters on URLs, at least on GET requests. I suspect that other early web servers also imitated Apache here, but I have less exposure to their behavior than Apache's. My guess is that this behavior wasn't deliberate, it was just the simplest way to implement both Apache and early web applications; you paid attention to what you cared about and didn't bother to explicitly check that nothing else was supplied.

When people noticed that this behavior was commonplace and widespread, they began using it. I believe that one of the early uses was for embedding 'where this link was shared' information for your own web analytics (cf), either based on your logs or using JavaScript embedded in the page. In the way of things, once this was common enough other people began helpfully tagging the links that were shared through them for you, which is why I began to see various 'utm_*' query parameters on inbound requests to Wandering Thoughts even though I never published such URLs. Web developers don't leave attractive nuisances alone for long, so soon enough people were sticking on extra query parameters to your URLs that were mostly for them and not so much for you. Facebook may have been one of the early pioneers here with their 'fbclid' parameter, but other websites have hopped on this particular train since then (as I saw recently with these 's=NN' parameters).

At this point, the practice of other websites and services adding random query parameters to your URLs that pass through them is so wide spread and common that accepting random query parameters is pretty much a practical requirement for any web content serving software that wants to see wide use and not be irritating to the people operating it. If, like DWiki, you stick to your guns and refuse to accept some or all of them, you will drop some amount of your incoming requests from real people, disappointing would be readers.

This practical requirement for URL handling is not documented in any specification, and it's probably not in most 'best practices' documentation. People writing new web serving systems that are tempted to be strict and safe and cautious get to learn about it the hard way.

In general, any laxness in actual implementations of a system can create a similar spiral of de facto requirements. Something that is permitted and is useful to people will be used, and then supporting that becomes a requirement. This is especially the case in a distributed system like the web, where any attempt to tighten the rules would only be initially supported by a minority of websites. These websites would be 'outvoted' by the vast majority of websites that allow the lax behavior and support it, because that's what happens when the vast majority work and the minority don't.

DeFactoQueryParameters written at 00:17:18; Add Comment

2020-09-04

In practice, cool URLs change (eventually)

The idea that "cool URLs don't change" has been an article of faith for a very long time. However, at this point we have more than 20 years of experience with the web, and anyone who's been around for a significant length of time can tell you that in practice, cool URLs change all of the time (and I don't mean just minor changes like preferring HTTPS over HTTP). Over a sufficient length of time, internal site page layouts change (sometimes because URL design is hard), people move domains or hosts within a domain, and sometimes cool URLs even go away and must be resurrected, sometimes by hand (through people re-publishing and re-hosting things) and sometimes through the Wayback Machine. This decay in cool URLs is so pervasive and well recognized that we have a term for it, link rot.

(Of course, you're a good person, and your cool URLs don't change. But this is the web and we all link to each other, so it's inevitable that some other people's cool URLs that you link to will suffer from link rot.)

Despite link rot being widely recognized as very real, I think that in many way's we're in denial about it. We keep pretending (both culturally and technically) that if we wish hard enough and try hard enough (and yell at people hard enough), all important URLs will be cool URLs that are unchanging forever. But this is not the case and is never going to be the case, and it's long past time that we admitted it and started dealing with it. Whether we like it or not, it is better to deal with the world of the web as it is.

Culturally, we recite "cool URLs don't change" a lot, which makes it hard to talk about how best to evolve URLs over time, how to preserve content that you no longer want to host, and other issues like that. I don't think anyone's written a best practices document for 'so you want to stop having a web site (but people have linked to it)', never mind what a company can do to be friendly for archiving when it goes out of business or shuts down a service. And that's just scratching the surface; there's a huge conversation to be had about the web over the long term once we admit out loud that nothing is forever around here.

(The Archive Team has opinions. But there are some hard issues here; there are people who have published words on the Internet, not under CC licenses, and then decided for their own reasons that they no longer want those words on the Internet despite the fact that other people like them, linked to them a lot, and so on.)

Technically, how we design our web systems and web environments often mostly ignores the possibility of future changes in either our own cool URLs or other people's. What this means in more tangible terms is really a matter for other entries, but if you look around you can probably come up with some ideas of your own. Just look for the pain points in your own web publishing environment if either your URLs or other people's URLs changed.

(One pain point and sign of problems is that it's a thing to spider your own site to find all of the external URLs so you can check if they're still alive. Another pain point is that it can be so hard to automatically tell if a link is still there, since not all dead links either fail entirely or result in HTTP error codes. Just ask people who have links pointing to what are now parked domains.)

CoolUrlsChange written at 00:41:56; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.