Googlebot and Feedfetcher are still aggressively grabbing syndication feeds

July 4, 2015

Somewhat more than a year ago I wrote about how I'd detected Googlebot aggressively crawling my syndication feeds, despite them being marked as 'stay away'. At the time I was contacted by someone from Google about this and forwarded various information about it.

Well, you can probably guess what happened next: nothing. It is now more than a year later and Googlebot is still determinedly attempting to pound away at fetching my syndication feed. In fact it made 25 requests for it yesterday, all of which got 403s as a result of me blocking it back then. In fact Googlebot is still trying on the order of 25 times a day despite getting 403s on all of its requests for this URL for literally more than a year.

(At least it seems to be down to only trying to fetch one feed URL.)

Also, because I was looking, back what is now more than a year and a half ago I discovered that Google Feedfetcher was still fetching feeds; as a result I blocked it. Well, that's still happening too. Based on the last 30 days or so, Google Feedfetcher is making anywhere between four and ten attempts a day. And yes, that's despite getting 403s for more than a year and a half. Apparently those don't really discourage Google's crawling activities if Google really wants your data.

I'd like to say that I'm surprised, but I'm not in the least bit. Google long ago stopped caring about being a good Internet citizen, regardless of what its propaganda may say. These days the only reason to tolerate it and its behavior is because you have no choice.

(As far as I can tell it remains the 800 pound gorilla of search traffic, although various things make it much harder for me to tell these days.)

Sidebar: The grumpy crazy idea of useless random content

If I was a real crazy person, it would be awfully tempting to divert Google's feed requests to something that fed them an endless or at least very large reply. It would probably want to be machine generated valid Atom feed entries full of more or less random content. There are of course all sorts of tricks that could be played here, like embedding honeypot URLs on a special web server and seeing if Google shows up to crawl them.

I don't care enough to do this, though. I have other fish to fry in my life, even if this stuff makes me very grumpy when I wind up looking at it.

Written on 04 July 2015.
« Wandering Thoughts is now ten years old
Sysadmin use of email is often necessarily more or less interrupt driven »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sat Jul 4 00:56:23 2015
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.