Wandering Thoughts archives

2006-04-26

Common web spider mistakes

It turns out that a lot of web spiders crawl my website here, and because I read my web server logs more than the average monkey I've gotten to see a number of web spider mistakes over and over again. These mistakes are so common that it's a rare day when I don't see at least one of them in the logs.

(These are mistakes as opposed to stupid spider tricks because they are so clearly wrong by the basic rules of HTTP.)

So, stupid web spider mistakes:

  • a leading slash on the URL in an <a href> means that you should not prepend the current page's URL to it when following the link. I am honestly boggled that any software gets this wrong, but some PHP-based thing appears to.

  • trailing slashes on the ends of URLs are not some optional element that you can omit. They have important semantic meaning, and the only thing you get if you leave them off is a HTTP redirect. (In that I am being nice; some places don't even give you the redirect.)

  • although the hostname component of URLs is case-independent, the rest of the URL is not. Spiders that lower-case URLs here get 404s, not the pages they are looking for.

  • the query parameter '?foo' is not the same as the query parameter '?foo='. You cannot rewrite the one into the other.

  • URLs that spiders give to web servers should not include fragment identifiers on the end (such as '#comments'). At least not if your spider wants me to give you any pages.

In writing this I actually read the HTTP 1.1 RFC and found out that in theory I should be accepting absolute URLs, with the 'http://hostname/' on the front, although equally in theory no one should be sending them (since our web server is not a proxy). All of the fragment identifier requests in the past 28 days are for absolute paths, though, not full absolute URLs, and fragments aren't allowed in either absolute paths or absolute URLs in HTTP.

Reading the RFC also answers the obvious question about absolute URLs: if a request uses an absolute URL, any Host: header is ignored and you take the hostname from the absolute URL (section 5.2). This both simplifies and complicates my life. (Also, missing Host: headers are an error in HTTP 1.1 requests (section 19.6.1.1). Fortunately, it appears that Apache already checks for that for me.)

web/StupidSpiderMistakes written at 02:04:40; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.