robots.txt
is a hint and a social contract between sites and web spiders
I recently read the Archive Team's Robots.txt is a suicide note (via), which
strongly advocates removing your robots.txt
. As it happens, I
have a somewhat different view (including about how sites don't
crash under load any more; we have students who beg to differ).
The simple way to put it is that the things I add to robots.txt
are hints to web spiders. Some of the time they are a hint that
crawling the particular URL hierarchy will not be successful anyways,
for example because the hierarchy requires authentication that the
robot doesn't have. We have inward facing websites
with sections that provide web-based services to local users, and
for that matter we have a webmail system. You can try to crawl those
URLs all day, but you're not getting anywhere and you never will.
Some of the time my robots.txt
entries are a hint that if you
crawl this anyways and I notice, I will use server settings to block
your robot from the entire site, including content that I was letting
you crawl before then. Presumably you would like to crawl some of
the content instead of none of it, but if you feel otherwise, well,
crawl away. The same is true of signals like Crawl-Delay
; you can
decide to ignore these, but if you do our next line of defense is
blocking you entirely. And we will.
(There are other sorts of hints, and for complex URL structures some of the hints of all sorts are delivered through nofollow. Beyond not irritating me, there are good operational reasons to pay attention to this.)
This points to the larger scale view of what robots.txt
is, which
is a social contract between sites and web spiders. Sites say
'respect these limits and we will (probably) not block you further'.
As a direct consequence of this, robots.txt
is also one method
to see whether a web spider is polite and well behaved or whether
it is rude and nasty. A well behaved web spider respects robots.txt
;
a nasty one does not. Any web spider that is crawling URLs that are
blocked in a long-standing robots.txt
is not a nice spider, and you can immediately proceed to whatever
stronger measures you feel like using against such things (up to
and including firewall IP address range bans, if you want).
By the way, it is a feature that robots self-identify themselves
when matching robots.txt
. A honest and polite web spider is
in a better position to know what it is than a site that has
to look at the User-Agent
and other indicators, especially
because people do dangerous things with their user-agent strings. If I ban a bad robot via server settings
and you claim to be sort of like that bad robot for some reason,
I'm probably banning you too as a side effect, and I'm unlikely
to care if that's a misfire; by and large it's your problem.
(With all of this said, the Archive Team has a completely sensible
reason for ignoring robots.txt
and I broadly support them doing
so. They will run into various sorts of problems from time to time
as a result of this, but they know what they're doing so I'm sure
they can sort the problems out.)
|
|