URL query parameters and how laxness creates de facto requirements on the web

September 7, 2020

One of the ways that DWiki (the code behind Wandering Thoughts) is unusual is that it strictly validates the query parameters it receives on URLs, including on HTTP GET requests for ordinary pages. If a HTTP request has unexpected and unsupported query parameters, such a GET request will normally fail. When I made this decision it seemed the cautious and conservative approach, but this caution has turned out to be a mistake on the modern web. In practice, all sorts of sites will generate versions of your URLs with all sorts of extra query parameters tacked on, give them to people, and expect them to work. If your website refuses to play along, (some) people won't get to see your content. On today's web, you need to accept (and then ignore) arbitrary query parameters on your URLs.

(Today's new query parameter is 's=NN', for various values of NN like '04' and '09'. I'm not sure what's generating these URLs, but it may be Slack.)

You might wonder how we got here, and that is a story of lax behavior (or, if you prefer, being liberal in what you accept). In the beginning, both Apache (for static web pages) and early web applications often ignored extra query parameters on URLs, at least on GET requests. I suspect that other early web servers also imitated Apache here, but I have less exposure to their behavior than Apache's. My guess is that this behavior wasn't deliberate, it was just the simplest way to implement both Apache and early web applications; you paid attention to what you cared about and didn't bother to explicitly check that nothing else was supplied.

When people noticed that this behavior was commonplace and widespread, they began using it. I believe that one of the early uses was for embedding 'where this link was shared' information for your own web analytics (cf), either based on your logs or using JavaScript embedded in the page. In the way of things, once this was common enough other people began helpfully tagging the links that were shared through them for you, which is why I began to see various 'utm_*' query parameters on inbound requests to Wandering Thoughts even though I never published such URLs. Web developers don't leave attractive nuisances alone for long, so soon enough people were sticking on extra query parameters to your URLs that were mostly for them and not so much for you. Facebook may have been one of the early pioneers here with their 'fbclid' parameter, but other websites have hopped on this particular train since then (as I saw recently with these 's=NN' parameters).

At this point, the practice of other websites and services adding random query parameters to your URLs that pass through them is so wide spread and common that accepting random query parameters is pretty much a practical requirement for any web content serving software that wants to see wide use and not be irritating to the people operating it. If, like DWiki, you stick to your guns and refuse to accept some or all of them, you will drop some amount of your incoming requests from real people, disappointing would be readers.

This practical requirement for URL handling is not documented in any specification, and it's probably not in most 'best practices' documentation. People writing new web serving systems that are tempted to be strict and safe and cautious get to learn about it the hard way.

In general, any laxness in actual implementations of a system can create a similar spiral of de facto requirements. Something that is permitted and is useful to people will be used, and then supporting that becomes a requirement. This is especially the case in a distributed system like the web, where any attempt to tighten the rules would only be initially supported by a minority of websites. These websites would be 'outvoted' by the vast majority of websites that allow the lax behavior and support it, because that's what happens when the vast majority work and the minority don't.


Comments on this page:

From 193.219.181.242 at 2020-09-07 00:40:05:

I think sites should instead accept the requests with parameters... and 301 redirect them to the canonical URL without parameters. (Or use JS history.pushState() to the same effect.) This would at least ensure that the weird Analytics garbage does not propagate through further re-shares, or into people's bookmarks, and the browser's address bar looks nice as well.

Could it also be for cache busting? i.e. add a random GET parameter and value, make sure to get the most recent version of content to generate a preview or something of the link?

You shouldn’t den requests with junk in the URL for reasons similar to why you shouldn’t serve XHTML. If a visitor shows up with junk attached to the URL, it’s almost always someone else who attached it for the visitor to carry along upon clicking the link. Denying such a request affects the visitor, but not whoever sent them, so it punishes an innocent party who has no relevant power to do something about it (they can edit their address bar but generally not the link they clicked on), while failing to convey any signal to the offending party.

However – visitors will propagate what’s in their address bar if they find the page interesting, and that includes any junk. So I came to suggest what 193.219.181.242 already did – redirecting them to clean up their URLs. Being helpful to visitors does not mean being forced to open up room for piggy-back pass-through tracking beacons in your URL space to the people sending you those visitors. (That’s aside from all the other undesirable effects of having multiple URLs for the same page, like making your logs harder to analyse, making your social link shares harder to track, etc.)

(Argh. s/You shouldn’t den/&y/ …)

By using_linux at 2020-09-08 03:14:07:

I agree with the redirect route. It may help to think of these URIs as spelling errors (as in Apache's mod_speling) or other clear but fixable errors (like the lack of a trailing slash, when the server clearly only gave out links to this resource with a slash at the end).

For technical audiences, a 404 with a `Refresh: 10; URL=the-fixed-one` could be helpful, with a message like "Whoever gave you this link put garbage on it, you will be redirected to what we hope is the intended page (but please nag who gave you that link)".

By Just a guy at 2020-09-08 04:18:56:

I believe the ?s=NN parameter that you mention is a Twitter thing. I can't find a link, but on twitter when you share a link they add the s parameter.

From what I can remember, the value of the parameter is determined by the type of twitter client that you shared from. For example, twitter web, twitter for android, twitter for iOS, tweetDeck, etc.

By Andy at 2020-09-08 16:42:27:

The internet was designed with Postel's law (be conservative in what you send and liberal in what you accept). The web was designed with this as well. Early on, the internet designers learned it's better to be liberal in what you accept (disregarding extras like url query params the code doesn't use)...strictly validating makes code more brittle. It's not a fault, it's a feature that has enabled interoperability for decades.

By Anonymous at 2020-09-09 16:29:45:

@Andy: Actually, these days (some) people argue that, in hindsight, Postel was wrong.

https://tools.ietf.org/id/draft-thomson-postel-was-wrong-03.html

Written on 07 September 2020.
« Daniel J. Bernstein's IM2000 email proposal is not a good idea
Why Fedora version upgrades are complicated and painful for me »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Sep 7 00:17:18 2020
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.