The web fun fact that domains can end in dots and canonicalization failures

August 29, 2024

Recently, my section of the Fediverse learned that the paywall of a large US-based news company could be bypassed simply by putting a '.' at the end of the website name. That is to say, you asked for 'https://newssite.com./article' instead of 'https://newssite.com/article'. People had a bit of a laugh (myself included) and also sympathized, because this is relatively obscure DNS trivia. Later, I found myself with a bit of a different view, which is that this is a failure of canonicalization in the web programming and web server environment.

(One theory for how this issue could happen is that the news company runs multiple sites from the same infrastructure and wants the paywall to only apply to some of them. Modern paywalls are relatively sophisticated programming, so I can easily imagine listing off the domains that should be affected by the paywall and missing the 'domain.' forms, perhaps because the people doing the programming simply don't know that bit of trivial.)

At the textual level, there are a lot of ways to vary host names and URLs. Hostnames are case independent, characters in URLs can be %-encoded, and so on (and I'm going to leave out structural modifications like '/./' and '/../' URL path elements or adding random query parameters). Web programming and web server environments already shield people from some of those by default; for example, if you configure a name-based virtual host, I think basically every web server will treat the name you provided as a case-independent one. Broadly we can consider this as canonicalizing the URL and other HTTP request information for you, so that you don't have to do it and thus you don't have to know all of the different variations that are possible.

It's my view that this canonicalization should also happen for host and domain names with dots at the end. Your web programming code should not have to even care about the possibility by default, any more than you probably have to care about it when configuring virtual hosts. If you really wanted to know low-level details about the request you should be able to, but the normal, easily accessible information you use for comparing and matching and so on should be canonicalized for you. This way it can be handled once by experts who know all of the crazy things that can appear in URLs, instead of repeatedly by web programmers who don't.

(Because if we make everyone handle this themselves we already know what's going to happen; some of them won't, and then we'll get various sorts of malfunctions, bugs, and security issues.)

PS: I've probably written some web related code that gets this wrong, treating 'domain.' and 'domain' as two separate things (and so probably denying access to the 'domain.' form as an unknown host). In fact if you try this here on Wandering Thoughts, you'll get a redirection to the dotless version of the domain, but this is because I put in a general 'redirect all weird domain variations to the canonical domain' feature a long time ago.

(My personal view is that redirecting to the canonical form of the domain is a perfectly valid thing to do in this situation.)


Comments on this page:

By Alex at 2024-08-30 09:39:46:

In fact if you try this here on Wandering Thoughts, you'll get a redirection to the dotless version of the domain,

Hmm?

https://utcc.utoronto.ca./~cks/space/blog/ gives me

400 Bad Request
Your browser sent a request that this server could not understand.
By cks at 2024-08-30 10:46:58:

Well, that's interesting. I tested with a completely manual request (using a tool I have) where I put in a 'Host: utcc.utoronto.ca.' header on the GET. This gave me the HTTP redirection. But using a real broswer gets the 400 Bad Request, which almost certainly comes from my DWiki code instead of from the Apache frontend. I wrote the code but I don't know why it's doing things differently (or what could be the trigger).

By cks at 2024-08-30 11:03:09:

I take back what I said about this being DWiki. I think this is Apache, and it doesn't even happen for all browsers. If I use lynx, it works (I get redirected), but if I use Firefox, I get the 400 error. I can't see anything in the server logs and the HTTP headers look normal in the web tools view.

By kfraz at 2024-08-30 14:11:06:

It's my view that this canonicalization should also happen for host and domain names with dots at the end.

I agree, but note that it doesn't necessarily solve the "problem". For example, maybe "news.example." will no longer bypass the paywall, but what if send a host header of "bogus.news.example"? (It'd need an extension, a custom "hosts" file, or similar.) Since the paywall-implementing code seems to fail open for unexpected domains, that would probably work.

Then again, I think a lot of these paywalls are meant to be porous. Various web-archivers and search engines manage to get the articles. Some sites just never show the paywall to people with cookies disabled or cleared, or with Javascript disabled. The New York Times still runs a .onion mirror, to which Tor Browser offers to redirect me even when I'm staring at the paywall—and upon accepting, I usually see the article.

You mention "." and "..", and RFC 3986 calls those out specifically, but it's not clear to me whether it actually requires software to handle them specially; the answer's obviously "yes" when dealing with hierarchical data, but §3.3 is somewhat vague in its statement that a path is "usually" hierarchical.

By Ian Z aka nobrowser at 2024-08-30 15:27:16:

Nitpick: Isn't the spelling with a trailing dot in fact the "canonical" one? The domain name space has a single root with an empty label, and the usual spelling is just a result of the root being in the search path, whether that's explicitly configured or built into the resolver.

I'm not suggesting we should be typing the dotful form into browser URL bars :-)

By cks at 2024-08-30 15:42:19:

The dot at end form of host and domain names is canonical only in the context of DNS, where '.' by itself has the special meaning of the DNS root zone. Everywhere else, I would maintain that it's not the canonical form, and it only works (to the extent that it works) because things hack it in.

One place it's explicitly not the canonical form is in TLS certificates. TLS certificates are always issued to fully qualified domain names and '.' does not appear in the names in the certificates (as far as I know and can tell from inspecting certificates). This means that browsers and other TLS tools treat the terminal dot specially, since they must strip it off (if you gave it) when they compare DNS names to verify that the certificate is for the right site.

(Unsurprisingly, some tools do not treat names this way. My version of wget fails here.)

Broadly we can consider this as canonicalizing the URL and other HTTP request information for you, so that you don't have to do it and thus you don't have to know all of the different variations that are possible.

I agree that this should happen, but I disagree very strongly with why. You should know about all of the variations. The reason for canonicalization is so that when f.ex. you want to list domains that should match, you don’t need to list out every possible spelling – not because you shouldn’t even need to know of them, but because you’d be effectively reimplementing all the rules ad hoc on the spot, and every implementation is another opportunity for bugs. Abstractions reduce work and increase robustness (basically the same benefit you’ve pointed out for automation, even when little efficiency is gained from it); they almost never remove the need to understand the complexity they’re encapsulating, because they are just about invariably leaky.

(In practical terms our positions do boil down to almost the same thing. (And in this particular example it makes no discernible difference at all.) In fact I don’t know that on reflection you’ll even disagree with me at all. Then again maybe you will.)

By Etienne Dechamps at 2024-08-31 08:55:19:

Whether the trailing dot should be treated as "canonical" is an interesting question. Clearly, in common usage it isn't (who adds a dot to the addresses of websites they visit?), but technically it probably should have been: names without trailing dots are technically ambiguous because they interact with local DNS suffixes. For example, if you go to google.com and your device is configured with a DNS suffix of utoronto.ca, then technically you could end up at google.com.utoronto.ca. If you want to make sure you're actually going to google.com, at least in theory you're supposed to type google.com., which will never append local DNS suffixes.

By the way, for more ways to exploit URL confusion for fun and profit, this paper is an interesting read.

By Walex at 2024-09-01 05:28:48:

«names without trailing dots are technically ambiguous because they interact with local DNS suffixes»

The DNS properly defined does not have dots at the end neither it has “technically ambiguous” names or “local DNS suffixes” and it only has absolute domain names that must be specified in full.

Both “local DNS suffixes” and a dot at the end are solely features of particular (if common) resolver libraries, and knowing that is part of knowing that particular resolver libraries offer such abstractions over the DNS, which is an example of the issue our blogger was mentioning.

By DanielMartin at 2024-10-07 14:52:37:

The reason you get a 400 from real browsers and a redirect from some other tools is likely due to whether the tool sends the TLS SNI (Server Name Indication) extension in the TLS ClientHello message.

If a client connects with no SNI in the ClientHello message, you get a redirect. If a client connects with an SNI that says utcc.utoronto.ca but with a Host: header that says something else, you get a redirect. If a client connects with an SNI that points to an unknown host (such as utcc.utoronto.ca., but also if they connect with an SNI that points to some total garbage name), the client gets a 400 response.

I believe that the 400 response is coming from Apache itself and is a signal that it can't find an acceptable VirtualHost directive.

-- DanielMartin (lost my password here ages and ages ago)

Written on 29 August 2024.
« How not to upgrade (some) held packages on Ubuntu (and Debian)
Mercurial's extdiff extension and reporting filenames in diffs »

Page tools: View Source, View Normal.
Search:
Login: Password:

Last modified: Thu Aug 29 22:58:05 2024
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.