Diffbot's bad Referer
header
Today a web spider called 'Diffbot' (run by diffbot.com) made a whole
bunch of requests here, all of which failed. They failed because, just
as it has repeatedly done in the past, it made them all with a Referer
header of 'http://news.google.com/
' and this behavior long ago led me
to ban it entirely from here.
There are a number of things wrong with this header. The first is that,
to steal from the old Trix commercials, 'silly robot, the Referer
header is for humans'. I've writen about this before at some length and doing it here is generally a good way to get
your spider banned.
(I have a philosophical ramble about why this is the correct view, but it's going in another entry.)
The second is that, of course, this Referer
value is a flaming lie
in two different ways. Diffbot in no way shape or form traveled from
news.google.com to the whole collection of URLs here that it attempted
to crawl with that Referer
header and on top of that, news.google.com
does not link to here at all. Diffbot made up the header from whole
cloth. I react very badly to web spiders that lie to me at the best of
times (even if they aren't spraying junk over my referer logs).
Diffbot and its operators may or may not be legitimate, or at least honest about what they're doing; I have no particular opinions on that. But they are unquestionably operating a web spider that routinely lies. I have no idea why and really, I don't care; I was doing them a favour by letting them crawl me and I can and will withdraw that favour if they irritate me.
(See also my technical requirements for web spiders and my standards for responsible spider behavior.)
(No, I haven't mailed Diffbot's operators about this behavior. Are you kidding? I'm neither crazy nor stupid. On today's Internet, mailing people about issues is for people that you actually trust.)
|
|