Why web robots sending Referer headers is wrong

May 23, 2013

I've written before on my view that web robots of all sorts should never send a Referer header. In those entries I mostly said 'don't do that' without giving a solid philosophical argument about why, so today I feel like changing that.

(Not that a philosophical argument actually matters. Proper behavior on the web is defined by social convention, ie by what lots of other people do and expect, not by arguing with people over what makes sense. Whether or not you agree with a social convention you break it at your peril, and today robots not sending Referer headers is a well established social convention that I will ban you for violating. And anyways the people who should read this never will.)

There are two philosophical reasons why it's wrong for robots to send Referer headers. The first is inherent in what the Referer header means, namely 'I just followed a link from page <X>'. This is a description of human behavior but not really of robot behavior; almost no web robot actually traverses the web in that way, finding links and immediately following them. If you crawl web pages, accumulate links, and then some time later crawl those links, you are not 'following a link' in any conventional sense. Worse, what happens if you discover the same link through multiple source documents? Which document gets 'credit' and appears in Referer?

(Yes, yes, this is not quite the spec definition, which kind of permits the 'I found it here' meaning that robots sometimes use. It is instead the practical definition of the header, as defined by how most everything behaves.)

So, you say, you don't care; you want to use Referer as a kind of 'this is what links to you' field for servers. I can summarize a bunch of problems here by saying that the Referer field is a terrible way to communicate this information to web operators, fundamentally because you are trying to use a side effect of HTTP requests to pass on what may be a huge amount of information. If you actually want to be useful you should make this information available on your own web site where people can see and fetch it in bulk.

Finally, the brutal truth is that 'who links to me' is by far less interesting than 'who is sending human traffic to me (right now)'. By far the most valuable part of Referer is information on where real (human) visitors are coming from, to the extent that it's possible to find this out. Being read by people is the ultimate purpose of most web pages, which makes what places are the source of traffic and active links something of decided interest to us. And this sort of human behavior has very little to do with either robot behavior or what potential links exist out there in the world. Mingling either your robot's actions or a 'helpful' attempt to tell us about the latter is not doing us any favours; rather the contrary, in fact (this is one large reason that I react angrily to robots sending Referer).

(There is also the inconvenient fact that once you're operating a decent sized site you're not likely to really care about who links to you because there will be far too many links out there, most of them in increasingly obscure and unimportant places. The links you do care about are exactly the links that send you significant traffic.)

Written on 23 May 2013.
« Diffbot's bad Referer header
My issue with infinite scrolling web pages: the lack of a stopping point »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Thu May 23 00:25:17 2013
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.