In practice, there are multiple namespaces for URLs

January 3, 2006

In theory, the HTTP and URI/URL standards say that URLs are all in a single namespace, as opposed to GET, POST, etc all using different URL namespaces, where some URLs only exist for POST and some only exist for GET.

In practice, I believe that web traversal software should behave as if there were two URL namespaces on websites: one for GET and HEAD requests, and a completely independent one for POST requests. Crawling software should not issue 'cross-namespace' URL requests, because you simply can't assume that a URL that is valid in one can even be used in the other.

This isn't very hard for POST requests; not much software makes them, and there's lots of things that make sending useful POST requests off to URLs you've only seen in GET contexts difficult. (In theory you could try converting GET requests with parameters into POST form requests with the same parameters, but I suspect this will strike people as at least dangerous and questionable.)

Unfortunately I've seen at least one piece of software that went the other way, issuing GET requests for URLs that only appeared as the target of POST form actions. Since it tried this inside CSpace the requests went down in flames, because I'm cautious about anything involving POST (and I get grumpy when things 'rattle the doorknobs').

(The crawler in question was called SBIder, from sitesell.com, and this behavior is one reason it is now listed in our robots.txt.)

Written on 03 January 2006.
« Universities are peculiar places
Python synergies in list addressing »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Tue Jan 3 02:51:09 2006
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.