In practice, there are multiple namespaces for URLs

January 3, 2006

In theory, the HTTP and URI/URL standards say that URLs are all in a single namespace, as opposed to GET, POST, etc all using different URL namespaces, where some URLs only exist for POST and some only exist for GET.

In practice, I believe that web traversal software should behave as if there were two URL namespaces on websites: one for GET and HEAD requests, and a completely independent one for POST requests. Crawling software should not issue 'cross-namespace' URL requests, because you simply can't assume that a URL that is valid in one can even be used in the other.

This isn't very hard for POST requests; not much software makes them, and there's lots of things that make sending useful POST requests off to URLs you've only seen in GET contexts difficult. (In theory you could try converting GET requests with parameters into POST form requests with the same parameters, but I suspect this will strike people as at least dangerous and questionable.)

Unfortunately I've seen at least one piece of software that went the other way, issuing GET requests for URLs that only appeared as the target of POST form actions. Since it tried this inside CSpace the requests went down in flames, because I'm cautious about anything involving POST (and I get grumpy when things 'rattle the doorknobs').

(The crawler in question was called SBIder, from sitesell.com, and this behavior is one reason it is now listed in our robots.txt.)


Comments on this page:

By DanielMartin at 2006-01-04 00:14:06:

In practice, I believe that web traversal software should behave as if there were two URL namespaces on websites: one for GET and HEAD requests, and a completely independent one for POST requests.

Uh... I agree with this only if you follow it by the following in bold big letters, or whatever is necessary to get spider authors to pay attention:

POST requests should be expected to have arbitrary server-side consequences. This means that NO program should ever issue a POST request to a url without explicit human authorization for that URL. (So a pre-arranged web service is fine, but a spider doing a POST is not)

Although, come to think of it, I don't actually agree with the two namespaces issue then either. I do however agree with this:

The targets of <form action="..."> should only be requested by a user agent as the result of an input element on that form being activated. Specifically, the targets of form elements with method="POST" should not be expected to respond to GET requests, nor should those with method="GET" be expected to respond sensibly to a non-query url.

I'm tempted to consider probing the targets of forms with ordinary GET requests, as this spider is doing, hostile behavior equivalent to someone testing for well-known insecure scripts.

By cks at 2006-01-12 13:51:18:

I agree with POST forms being entirely off limits for spiders. But POST gets used for more than forms these days, and sooner or later someone is going to try to auto-discover them, and I want those people to understand that in practice POST and GET are separate namespaces and they shouldn't cross over.

The example I can think of now is XML/RPC, which is done over POST and has an (optional) service discovery protocol. I wouldn't be surprised if a standardized method to mark public XML/RPC points gets created someday (or it may already exist).

Written on 03 January 2006.
« Universities are peculiar places
Python synergies in list addressing »

Page tools: View Source, View Normal.
Search:
Login: Password:

Last modified: Tue Jan 3 02:51:09 2006
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.