Googlebot and Feedfetcher are still aggressively grabbing syndication feeds

July 4, 2015

Somewhat more than a year ago I wrote about how I'd detected Googlebot aggressively crawling my syndication feeds, despite them being marked as 'stay away'. At the time I was contacted by someone from Google about this and forwarded various information about it.

Well, you can probably guess what happened next: nothing. It is now more than a year later and Googlebot is still determinedly attempting to pound away at fetching my syndication feed. In fact it made 25 requests for it yesterday, all of which got 403s as a result of me blocking it back then. In fact Googlebot is still trying on the order of 25 times a day despite getting 403s on all of its requests for this URL for literally more than a year.

(At least it seems to be down to only trying to fetch one feed URL.)

Also, because I was looking, back what is now more than a year and a half ago I discovered that Google Feedfetcher was still fetching feeds; as a result I blocked it. Well, that's still happening too. Based on the last 30 days or so, Google Feedfetcher is making anywhere between four and ten attempts a day. And yes, that's despite getting 403s for more than a year and a half. Apparently those don't really discourage Google's crawling activities if Google really wants your data.

I'd like to say that I'm surprised, but I'm not in the least bit. Google long ago stopped caring about being a good Internet citizen, regardless of what its propaganda may say. These days the only reason to tolerate it and its behavior is because you have no choice.

(As far as I can tell it remains the 800 pound gorilla of search traffic, although various things make it much harder for me to tell these days.)

Sidebar: The grumpy crazy idea of useless random content

If I was a real crazy person, it would be awfully tempting to divert Google's feed requests to something that fed them an endless or at least very large reply. It would probably want to be machine generated valid Atom feed entries full of more or less random content. There are of course all sorts of tricks that could be played here, like embedding honeypot URLs on a special web server and seeing if Google shows up to crawl them.

I don't care enough to do this, though. I have other fish to fry in my life, even if this stuff makes me very grumpy when I wind up looking at it.


Comments on this page:

By Twirrim at 2015-07-04 12:20:27:

Depending on how much you care about Google you could automatically take the origin IP address and drop it into an ipset. Then hook that into a REJECT or DROP rule. The less you care about google, the higher up the subnet you go starting with /24 and working up to /8 :)

By Twirrim at 2015-07-04 12:58:08:

Taking a look through my own logs, looks like the web browser "Let's pretend we're something we're not" infection has spread:

"Feedly/1.0 (+http://www.feedly.com/fetcher.html; like FeedFetcher-Google)"

I did somewhat hope we were past that. I'm not sure how many sites even pay that much attention to the user agent string any more.

It does look like I'm seeing "Tiny Tiny RSS" from one IP address bounce off my rss feed a remarkable amount, roughly every 10-15 minutes, but note I'm behind Cloudflare, and they might actually be hiding requests from me.

Written on 04 July 2015.
« Wandering Thoughts is now ten years old
Sysadmin use of email is often necessarily more or less interrupt driven »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sat Jul 4 00:56:23 2015
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.