2025-01-22
More features for web page generation systems doing URL remapping
A few years ago I wrote about how web page generation systems should support remapping external URLs (this includes systems that convert some form of wikitext to HTML). At the time I was mostly thinking about remapping single URLs and mentioned things like remapping prefixes (so you could remap an entire domain into web.archive.org) as something for a fancier version. Well, the world turns and things happen and I now think that such prefix remapping is essential; even if you don't start out with it, you're going to wind up with it in the longer term.
(To put it one way, the reality of modern life is that sometimes you no longer want to be associated with some places. And some day, my Fediverse presence may also move.)
In light of a couple of years of churn in my website landscape (after what was in hindsight a long period of stability), I now have revised views on the features I want in a (still theoretical) URL remapping system for Wandering Thoughts. The system I want should be able to remap individual URLs, entire prefixes, and perhaps regular expressions with full scale rewrites (or maybe some scheme with wildcard matching), although I don't currently have a use for full scale regular expression rewrites. As part of this, there needs to be some kind of priority or hierarchy between different remappings that can all potentially match the same URL, because there's definitely at least one case today where I want to remap 'asite/a/*' somewhere and all other 'asite/*' URLs to something else. While it's tempting to do something like 'most specific thing matches', working out what is most specific from a collection of different sorts of remapping rules seems a bit hard, so I'd probably just implement it as 'first match wins' and manage things by ordering matches in the configuration file.
('Most specific match wins' is a common feature in web application frameworks for various reasons, but I think it's harder to implement here, especially if I allow arbitrary regular expression matches.)
Obviously the remapping configuration file should support comments (every configuration system needs to). Less obviously, I'd support file inclusion or the now common pattern of a '<whatever>.d' directory for drop in files, so that remapping rules can be split up by things like the original domain rather than having to all be dumped into an ever-growing single configuration file.
(Since more and more links rot as time passes, we can pretty much guarantee that the number of our remappings is going to keep growing.)
Along with the remapping, I may want something (ie, a tiny web application) that dynamically generates some form of 'we don't know where you can find this now but here is what the URL used to be' page for any URL I feed it. The obvious general reason for this is that sometimes old domain names get taken over by malicious parties and the old content is nowhere to be found, not even on web.archive.org. In that case you don't want to keep a link to what's now a malicious site, but you also don't have any other valid target for your old link. You could rewrite the link to some invalid domain name and leave it to the person visiting you and following the link to work out what happened, but it's better to be friendly.
(This is where you want to be careful about XSS and other hazards of operating what is basically an open 'put text in and we generate a HTML page with it shown in some way' service.)