Wandering Thoughts archives

2021-07-08

Redirecting paths that start with two slashes in Apache

Suppose that you have a dynamic web application of some sort that sits behind Apache, and an URL for one of your application's pages where the path starts with two slashes instead of one starts going around. In other words, people are sharing 'https://example.org//app/page', instead of the version with one slash at the start of the path. You can support this in your web app (with some potential cautions), but you would prefer to have Apache redirect the non-ideal URL to the canonical URL.

Under normal circumstances, this sort of selective redirection should be straightforward using mod_rewrite. If you wanted to rewrite only a single specific bad path instead of all potential ones, I'd expect something like the following to work:

RewriteCond %{REQUEST_URI} "=//app/page"
RewriteRule ^.* https://example.org/app/page

(This is one of the cases where exact string matching in RewriteCond is a useful thing.)

However, this appears not to work in at least a .htaccess file, and I suspect it won't work in the Apache configuration file either. Although Apache's %{REQUEST_URI} gives you the full URL path even in a .htaccess, it appears that Apache canonicalizes it for the purposes of things like RewriteCond and so turns the two leading slashes into one. This canonicalization isn't passed through to CGIs, though; they will see the original "//app/page" version.

(This canonicalization appears to apply to any / in the URL path, not just the first ones. If you write a condition for "/app/dir/page", it will match for URLs with any amount of additional slashes, eg "//app////dir///page" will match.)

Instead, the only way I found to do this was with Apache's special %{THE_REQUEST} variable for the full HTTP request line. As the mod_rewrite documentation covers, this value has not been escaped, which may cause you heartburn if people get clever or you want to do general matching. So the rule I wound up with looks like:

RewriteCond %{THE_REQUEST} "^GET //app/page "
RewriteRule ^.* https://example.org/app/page

This match is very specific, since we're doing a very specific HTTP redirection. You'd want to be more complicated if you need to handle query variables, for example. But it works, unlike the other option.

Possibly I'm missing a clever trick that enables a better version of this. I don't really like matching things so specifically, but it seems to be what you have to reach for in this unusual situation.

web/ApacheRedirectDoubleSlash written at 23:57:15; Add Comment

A semi-surprise with Python's urllib.parse and partial URLs

One of the nice things about urllib.parse (and its Python 2 equivalent) is that it will deal with partial URLs as well as full URLs. This is convenient because there are various situations in a web server context where you may get either partial URLs or full URLs, and you'd like to decode both of them in order to extract various pieces of information (primarily the path, since that's all you can reliably count on being present in a partial URL). However, URLs are tricky things once you peek under the hood; see, for example, URLs: It's complicated.... A proper URL parser needs to deal with that full complexity, and that means that it hides a surprise about how relative URLs will be interpreted.

Suppose, for example, that you're parsing an Apache REQUEST_URI to extract the request's path. You have to actually parse the request's URI to get this, because funny people can send you full URLs in HTTP GET requests, which Apache will pass through to you. Now suppose someone accidentally creates a URL for a web page of yours that looks like 'https://example.org//your/page/url' (with two slashes after the host instead of one) and visits it, and you attempt to decode the result of what Apache will hand you:

>>> urllib.parse.urlparse("//your/page/url")
ParseResult(scheme='', netloc='your', path='/page/url', params='', query='', fragment='')

The problem here is that '//ahost.org/some/path' is a perfectly legal protocol-relative URL, so that's what urllib.parse will produce when you give it something that looks like one, which is to say something that starts with '//'. Because we know where it came from, you and I know that this is a relative URL with an extra / at the front, but urlparse() can't make that assumption and there's no way to limit its standard-compliant generality.

If this is an issue for you (as it was for me recently), probably the best thing you can do is check for a leading '//' before you call urlparse() and turn it into just '/' (the simple way is to just strip off the first character in the string). Doing anything more complicated feels like it's too close to trying to actually understand URLs, which is the very job we want to delegate to urlparse() because it's complicated.

PS: Because I tested it just now, the result of giving urlparse() a relative URL that starts with three or more slashes is that it's interpreted as a relative URL, not a protocol-relative URL. The path of the result will have the extra leading slashes stripped off.

python/UrllibParsePartialURLs written at 00:10:39; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.