== Common web spider mistakes It turns out that a lot of web spiders crawl my website here, and because I read my web server logs more than the average monkey I've gotten to see a number of web spider mistakes over and over again. These mistakes are so common that it's a rare day when I don't see at least one of them in the logs. (These are mistakes as opposed to [[stupid spider tricks StupidSpiderTricks]] because they are so clearly *wrong* by the basic rules of HTTP.) So, stupid web spider mistakes: * a leading slash on the URL in an means that you should not prepend the current page's URL to it when following the link. I am honestly boggled that any software gets this wrong, but some PHP-based thing appears to. * trailing slashes on the ends of URLs are not some optional element that you can omit. They have important semantic meaning, and the only thing you get if you leave them off is a HTTP redirect. (In that I am being nice; some places don't even give you the redirect.) * although the hostname component of URLs is case-independent, the rest of the URL is not. Spiders that lower-case URLs here get 404s, not the pages they are looking for. * the query parameter '_?foo_' is *not* the same as the query parameter '_?foo=_'. You cannot rewrite the one into the other. * URLs that spiders give to web servers should not include fragment identifiers on the end (such as '_#comments_'). At least not if your spider wants me to give you any pages. In writing this I actually read the [[HTTP 1.1 RFC http://www.w3.org/Protocols/rfc2616/rfc2616.html]] and found out that in theory I should be accepting absolute URLs, with the '!http://hostname/' on the front, although equally in theory no one should be sending them (since our web server is not a proxy). All of the fragment identifier requests in the past 28 days are for absolute paths, though, not full absolute URLs, and fragments aren't allowed in either absolute paths or absolute URLs in HTTP. Reading the RFC also answers the obvious question about absolute URLs: if a request uses an absolute URL, any _Host:_ header is ignored and you take the hostname from the absolute URL ([[section 5.2 http://www.w3.org/Protocols/rfc2616/rfc2616-sec5.html#sec5.2]]). This both simplifies and complicates my life. (Also, missing _Host:_ headers are an error in HTTP 1.1 requests ([[section 19.6.1.1 http://www.w3.org/Protocols/rfc2616/rfc2616-sec19.html#sec19.6.1.1]]). Fortunately, it appears that Apache already checks for that for me.)