robots.txt is a hint and a social contract between sites and web spiders

February 17, 2017

I recently read the Archive Team's Robots.txt is a suicide note (via), which strongly advocates removing your robots.txt. As it happens, I have a somewhat different view (including about how sites don't crash under load any more; we have students who beg to differ).

The simple way to put it is that the things I add to robots.txt are hints to web spiders. Some of the time they are a hint that crawling the particular URL hierarchy will not be successful anyways, for example because the hierarchy requires authentication that the robot doesn't have. We have inward facing websites with sections that provide web-based services to local users, and for that matter we have a webmail system. You can try to crawl those URLs all day, but you're not getting anywhere and you never will.

Some of the time my robots.txt entries are a hint that if you crawl this anyways and I notice, I will use server settings to block your robot from the entire site, including content that I was letting you crawl before then. Presumably you would like to crawl some of the content instead of none of it, but if you feel otherwise, well, crawl away. The same is true of signals like Crawl-Delay; you can decide to ignore these, but if you do our next line of defense is blocking you entirely. And we will.

(There are other sorts of hints, and for complex URL structures some of the hints of all sorts are delivered through nofollow. Beyond not irritating me, there are good operational reasons to pay attention to this.)

This points to the larger scale view of what robots.txt is, which is a social contract between sites and web spiders. Sites say 'respect these limits and we will (probably) not block you further'. As a direct consequence of this, robots.txt is also one method to see whether a web spider is polite and well behaved or whether it is rude and nasty. A well behaved web spider respects robots.txt; a nasty one does not. Any web spider that is crawling URLs that are blocked in a long-standing robots.txt is not a nice spider, and you can immediately proceed to whatever stronger measures you feel like using against such things (up to and including firewall IP address range bans, if you want).

By the way, it is a feature that robots self-identify themselves when matching robots.txt. A honest and polite web spider is in a better position to know what it is than a site that has to look at the User-Agent and other indicators, especially because people do dangerous things with their user-agent strings. If I ban a bad robot via server settings and you claim to be sort of like that bad robot for some reason, I'm probably banning you too as a side effect, and I'm unlikely to care if that's a misfire; by and large it's your problem.

(With all of this said, the Archive Team has a completely sensible reason for ignoring robots.txt and I broadly support them doing so. They will run into various sorts of problems from time to time as a result of this, but they know what they're doing so I'm sure they can sort the problems out.)

Written on 17 February 2017.
« Sometimes, firmware updates can be a good thing to do
Using pup to deal with Twitter's increasing demand for Javascript »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Feb 17 23:16:33 2017
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.