Common web spider mistakesIt turns out that a lot of web spiders crawl my website here, and because I read my web server logs more than the average monkey I've gotten to see a number of web spider mistakes over and over again. These mistakes are so common that it's a rare day when I don't see at least one of them in the logs. (These are mistakes as opposed to stupid spider tricks because they are so clearly wrong by the basic rules of HTTP.) So, stupid web spider mistakes:
In writing this I actually read the HTTP 1.1 RFC and found out that in theory I should be accepting absolute URLs, with the 'http://hostname/' on the front, although equally in theory no one should be sending them (since our web server is not a proxy). All of the fragment identifier requests in the past 28 days are for absolute paths, though, not full absolute URLs, and fragments aren't allowed in either absolute paths or absolute URLs in HTTP. Reading the RFC also answers the obvious question about absolute
URLs: if a request uses an absolute URL, any |
These are my WanderingThoughts GettingAround This is part of CSpace, and is written by ChrisSiebenmann. * * * Atom feeds are available; see the bottom of most pages. Categories: links, linux, programming, python, snark, solaris, spam, sysadmin, tech, unix, web |