Doing multi-tag matching through URLs on the modern web

March 13, 2025

So what happened is that Mike Hoye had a question about a perfectly reasonable ideas:

Question: is there wiki software out there that handles tags (date, word) with a reasonably graceful URL approach?

As in, site/wiki/2020/01 would give me all the pages tagged as 2020 and 01, site/wiki/foo/bar would give me a list of articles tagged foo and bar.

I got nerd-sniped by a side question but then, because I'd been nerd-sniped, I started thinking about the whole thing and it got more and more hair-raising as a thing done in practice.

This isn't because the idea of stacking selections like this is bad; 'site/wiki/foo/bar' is a perfectly reasonable and good way to express 'a list of articles tagged foo and bar'. Instead, it's because of how everything on the modern web eventually gets visited combined with how, in the natural state of this feature, 'site/wiki/bar/foo' is just a valid a URL for 'articles tagged both foo and bar'.

The combination, plus the increasing tendency of things on the modern web to rattle every available doorknob just to see what happens, means that even if you don't advertise 'bar/foo', sooner or later things are going to try it. And if you do make the combinations discoverable through HTML links, crawlers will find them very fast. At a minimum this means crawlers will see a lot of essentially duplicated content, and you'll have to go through all of the work to do the searches and generate the page listings and so on.

If I was going to implement something like this, I would define a canonical tag order and then, as early in request processing as possible, generate a HTTP redirect from any non-canonical ordering to the canonical one. I wouldn't bother checking if the tags were existed or anything, just determine that they are tags, put them in canonical order, and if the request order wasn't canonical, redirect. That way at least all of your work (and all of the crawler attention) is directed at one canonical version. Smart crawlers will notice that this is a redirect to something they already have (and hopefully not re-request it), and you can more easily use caching.

(And if search engines still matter, the search engines will see only your canonical version.)

This probably holds just as true for doing this sort of tag search through query parameters on GET queries; if you expose the result in a URL, you want to canonicalize it. However, GET query parameters are probably somewhat safer if you force people to form them manually and don't expose links to them. So far, web crawlers seem less likely to monkey around with query parameters than with URLs, based on my limited experience with the blog.

Written on 13 March 2025.
« The commodification of desktop GUI behavior
Some notes on configuring Dovecot to authenticate via OIDC/OAuth2 »

Page tools: View Source.
Search:
Login: Password:

Last modified: Thu Mar 13 22:46:44 2025
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.