May 21, 2013

Today a web spider called 'Diffbot' (run by diffbot.com) made a whole bunch of requests here, all of which failed. They failed because, just as it has repeatedly done in the past, it made them all with a Referer header of 'http://news.google.com/' and this behavior long ago led me to ban it entirely from here.

There are a number of things wrong with this header. The first is that, to steal from the old Trix commercials, 'silly robot, the Referer header is for humans'. I've writen about this before at some length and doing it here is generally a good way to get your spider banned.

(I have a philosophical ramble about why this is the correct view, but it's going in another entry.)

The second is that, of course, this Referer value is a flaming lie in two different ways. Diffbot in no way shape or form traveled from news.google.com to the whole collection of URLs here that it attempted to crawl with that Referer header and on top of that, news.google.com does not link to here at all. Diffbot made up the header from whole cloth. I react very badly to web spiders that lie to me at the best of times (even if they aren't spraying junk over my referer logs).

Diffbot and its operators may or may not be legitimate, or at least honest about what they're doing; I have no particular opinions on that. But they are unquestionably operating a web spider that routinely lies. I have no idea why and really, I don't care; I was doing them a favour by letting them crawl me and I can and will withdraw that favour if they irritate me.

(See also my technical requirements for web spiders and my standards for responsible spider behavior.)

(No, I haven't mailed Diffbot's operators about this behavior. Are you kidding? I'm neither crazy nor stupid. On today's Internet, mailing people about issues is for people that you actually trust.)

Written on 21 May 2013.
Last modified: Tue May 21 23:20:49 2013
