A semi-surprise with Python's urllib.parse and partial URLs
One of the nice things about urllib.parse (and its Python 2 equivalent) is that it will deal with partial URLs as well as full URLs. This is convenient because there are various situations in a web server context where you may get either partial URLs or full URLs, and you'd like to decode both of them in order to extract various pieces of information (primarily the path, since that's all you can reliably count on being present in a partial URL). However, URLs are tricky things once you peek under the hood; see, for example, URLs: It's complicated.... A proper URL parser needs to deal with that full complexity, and that means that it hides a surprise about how relative URLs will be interpreted.
Suppose, for example, that you're parsing an Apache REQUEST_URI
to extract the request's path. You have to actually parse the
request's URI to get this, because funny people can send you full
URLs in HTTP GET
requests, which Apache will pass through to you.
Now suppose someone accidentally creates a URL for a web page of
yours that looks like 'https://example.org//your/page/url' (with
two slashes after the host instead of one) and visits it, and you
attempt to decode the result of what Apache will hand you:
>>> urllib.parse.urlparse("//your/page/url") ParseResult(scheme='', netloc='your', path='/page/url', params='', query='', fragment='')
The problem here is that '//ahost.org/some/path' is a perfectly legal protocol-relative URL, so that's what urllib.parse will produce when you give it something that looks like one, which is to say something that starts with '//'. Because we know where it came from, you and I know that this is a relative URL with an extra / at the front, but urlparse() can't make that assumption and there's no way to limit its standard-compliant generality.
If this is an issue for you (as it was for me recently), probably the best thing you can do is check for a leading '//' before you call urlparse() and turn it into just '/' (the simple way is to just strip off the first character in the string). Doing anything more complicated feels like it's too close to trying to actually understand URLs, which is the very job we want to delegate to urlparse() because it's complicated.
PS: Because I tested it just now, the result of giving urlparse()
a relative URL that starts with three or more slashes is that it's
interpreted as a relative URL, not a protocol-relative URL. The
path
of the result will have the extra leading slashes stripped
off.
Comments on this page:
|
|