2006-04-26
Common web spider mistakes
It turns out that a lot of web spiders crawl my website here, and because I read my web server logs more than the average monkey I've gotten to see a number of web spider mistakes over and over again. These mistakes are so common that it's a rare day when I don't see at least one of them in the logs.
(These are mistakes as opposed to stupid spider tricks because they are so clearly wrong by the basic rules of HTTP.)
So, stupid web spider mistakes:
- a leading slash on the URL in an <a href> means that you should
not prepend the current page's URL to it when following the link.
I am honestly boggled that any software gets this wrong, but
some PHP-based thing appears to.
- trailing slashes on the ends of URLs are not some optional
element that you can omit. They have important semantic
meaning, and the only thing you get if you leave them off
is a HTTP redirect. (In that I am being nice; some places
don't even give you the redirect.)
- although the hostname component of URLs is case-independent,
the rest of the URL is not. Spiders that lower-case URLs here
get 404s, not the pages they are looking for.
- the query parameter '
?foo
' is not the same as the query parameter '?foo=
'. You cannot rewrite the one into the other. - URLs that spiders give to web servers should not include fragment
identifiers on the end (such as '
#comments
'). At least not if your spider wants me to give you any pages.
In writing this I actually read the HTTP 1.1 RFC and found out that in theory I should be accepting absolute URLs, with the 'http://hostname/' on the front, although equally in theory no one should be sending them (since our web server is not a proxy). All of the fragment identifier requests in the past 28 days are for absolute paths, though, not full absolute URLs, and fragments aren't allowed in either absolute paths or absolute URLs in HTTP.
Reading the RFC also answers the obvious question about absolute
URLs: if a request uses an absolute URL, any Host:
header is
ignored and you take the hostname from the absolute URL (section
5.2).
This both simplifies and complicates my life. (Also, missing Host:
headers are an error in HTTP 1.1 requests (section 19.6.1.1).
Fortunately, it appears that Apache already checks for that for me.)