2024-08-29
The web fun fact that domains can end in dots and canonicalization failures
Recently, my section of the Fediverse learned that the paywall of a large US-based news company could be bypassed simply by putting a '.' at the end of the website name. That is to say, you asked for 'https://newssite.com./article' instead of 'https://newssite.com/article'. People had a bit of a laugh (myself included) and also sympathized, because this is relatively obscure DNS trivia. Later, I found myself with a bit of a different view, which is that this is a failure of canonicalization in the web programming and web server environment.
(One theory for how this issue could happen is that the news company runs multiple sites from the same infrastructure and wants the paywall to only apply to some of them. Modern paywalls are relatively sophisticated programming, so I can easily imagine listing off the domains that should be affected by the paywall and missing the 'domain.' forms, perhaps because the people doing the programming simply don't know that bit of trivial.)
At the textual level, there are a lot of ways to vary host names and URLs. Hostnames are case independent, characters in URLs can be %-encoded, and so on (and I'm going to leave out structural modifications like '/./' and '/../' URL path elements or adding random query parameters). Web programming and web server environments already shield people from some of those by default; for example, if you configure a name-based virtual host, I think basically every web server will treat the name you provided as a case-independent one. Broadly we can consider this as canonicalizing the URL and other HTTP request information for you, so that you don't have to do it and thus you don't have to know all of the different variations that are possible.
It's my view that this canonicalization should also happen for host and domain names with dots at the end. Your web programming code should not have to even care about the possibility by default, any more than you probably have to care about it when configuring virtual hosts. If you really wanted to know low-level details about the request you should be able to, but the normal, easily accessible information you use for comparing and matching and so on should be canonicalized for you. This way it can be handled once by experts who know all of the crazy things that can appear in URLs, instead of repeatedly by web programmers who don't.
(Because if we make everyone handle this themselves we already know what's going to happen; some of them won't, and then we'll get various sorts of malfunctions, bugs, and security issues.)
PS: I've probably written some web related code that gets this wrong, treating 'domain.' and 'domain' as two separate things (and so probably denying access to the 'domain.' form as an unknown host). In fact if you try this here on Wandering Thoughts, you'll get a redirection to the dotless version of the domain, but this is because I put in a general 'redirect all weird domain variations to the canonical domain' feature a long time ago.
(My personal view is that redirecting to the canonical form of the domain is a perfectly valid thing to do in this situation.)