2013-05-23
Why web robots sending Referer
headers is wrong
I've written before on my view that web robots of all sorts should
never send a Referer
header. In those entries I mostly said 'don't do
that' without giving a solid philosophical argument about why, so today
I feel like changing that.
(Not that a philosophical argument actually matters. Proper behavior
on the web is defined by social convention, ie by what lots of other
people do and expect, not by arguing with people over what makes
sense. Whether or not you agree with a social convention you break it
at your peril, and today robots not sending Referer
headers is a
well established social convention that I will ban you for violating. And anyways the people who should read this never
will.)
There are two philosophical reasons why it's wrong for robots to
send Referer
headers. The first is inherent in what the Referer
header means, namely 'I just followed a link from page <X>'. This is a
description of human behavior but not really of robot behavior; almost
no web robot actually traverses the web in that way, finding links and
immediately following them. If you crawl web pages, accumulate links,
and then some time later crawl those links, you are not 'following a
link' in any conventional sense. Worse, what happens if you discover
the same link through multiple source documents? Which document gets
'credit' and appears in Referer
?
(Yes, yes, this is not quite the spec definition, which kind of permits the 'I found it here' meaning that robots sometimes use. It is instead the practical definition of the header, as defined by how most everything behaves.)
So, you say, you don't care; you want to use Referer
as a kind of
'this is what links to you' field for servers. I can summarize a bunch
of problems here by saying that the Referer
field is a terrible way
to communicate this information to web operators, fundamentally because
you are trying to use a side effect of HTTP requests to pass on what may
be a huge amount of information. If you actually want to be useful you
should make this information available on your own web site where people
can see and fetch it in bulk.
Finally, the brutal truth is that 'who links to me' is by far less
interesting than 'who is sending human traffic to me (right now)'. By
far the most valuable part of Referer
is information on where real
(human) visitors are coming from, to the extent that it's possible
to find this out. Being read by people is
the ultimate purpose of most web pages, which makes what places are the
source of traffic and active links something of decided interest to
us. And this sort of human behavior has very little to do with either
robot behavior or what potential links exist out there in the world.
Mingling either your robot's actions or a 'helpful' attempt to tell us
about the latter is not doing us any favours; rather the contrary, in
fact (this is one large reason that I react angrily to robots sending
Referer
).
(There is also the inconvenient fact that once you're operating a decent sized site you're not likely to really care about who links to you because there will be far too many links out there, most of them in increasingly obscure and unimportant places. The links you do care about are exactly the links that send you significant traffic.)