A semi-surprise with Python's urllib.parse and partial URLs

July 8, 2021

One of the nice things about urllib.parse (and its Python 2 equivalent) is that it will deal with partial URLs as well as full URLs. This is convenient because there are various situations in a web server context where you may get either partial URLs or full URLs, and you'd like to decode both of them in order to extract various pieces of information (primarily the path, since that's all you can reliably count on being present in a partial URL). However, URLs are tricky things once you peek under the hood; see, for example, URLs: It's complicated.... A proper URL parser needs to deal with that full complexity, and that means that it hides a surprise about how relative URLs will be interpreted.

Suppose, for example, that you're parsing an Apache REQUEST_URI to extract the request's path. You have to actually parse the request's URI to get this, because funny people can send you full URLs in HTTP GET requests, which Apache will pass through to you. Now suppose someone accidentally creates a URL for a web page of yours that looks like 'https://example.org//your/page/url' (with two slashes after the host instead of one) and visits it, and you attempt to decode the result of what Apache will hand you:

>>> urllib.parse.urlparse("//your/page/url")
ParseResult(scheme='', netloc='your', path='/page/url', params='', query='', fragment='')

The problem here is that '//ahost.org/some/path' is a perfectly legal protocol-relative URL, so that's what urllib.parse will produce when you give it something that looks like one, which is to say something that starts with '//'. Because we know where it came from, you and I know that this is a relative URL with an extra / at the front, but urlparse() can't make that assumption and there's no way to limit its standard-compliant generality.

If this is an issue for you (as it was for me recently), probably the best thing you can do is check for a leading '//' before you call urlparse() and turn it into just '/' (the simple way is to just strip off the first character in the string). Doing anything more complicated feels like it's too close to trying to actually understand URLs, which is the very job we want to delegate to urlparse() because it's complicated.

PS: Because I tested it just now, the result of giving urlparse() a relative URL that starts with three or more slashes is that it's interpreted as a relative URL, not a protocol-relative URL. The path of the result will have the extra leading slashes stripped off.

Written on 08 July 2021.
« The initramfs for old kernels can hide old versions of things
Redirecting paths that start with two slashes in Apache »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Thu Jul 8 00:10:39 2021
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.