Wandering Thoughts archives

2014-03-02

Googlebot is now aggressively crawling syndication feeds

I'm not sure how long this has been going on (I only noticed it recently) but Googlebot, Google's search crawler, is now aggressively crawling syndication feeds. By 'aggressively crawling' I mean two things. First, it is fetching the feeds multiple times a day; one of my feeds was fetched 46 times in one 24-hour period. Second and worse, it's not using conditional GET.

I've written before about why web spiders should not crawl syndication feeds and I still believe everything I wrote back then (even though I've significantly reduced the number of feeds I advertise since those days). My feed URLs are all marked 'nofollow', a declaration that Googlebot generally respects. And even if Google was going to crawl syndication feeds, the minimum standard is implementing conditional GET instead of repeatedly spamming fetch requests; the latter is the kind of thing that gets you banned here.

I might very reluctantly accept Googlebot crawling a few syndication feed URLs if they properly implemented conditional GET. Then it might be a reasonable move to find updated content (although Googlebot accesses my sitemap much less frequently) and I'd passively go along with the 800 pound gorilla of search traffic. But without conditional GET it's my strong opinion that this is abuse plain and simple, and I have no interest in cooperation.

So, in short: I suggest that you check your syndication feed logs to see if Googlebot is pounding on them too and if it is, block it from accessing those URLs. I doubt Google is going to change its behavior any time soon or even notice, but at least you can avoid donating your site resources to an abusive crawler.

(As I expected, Googlebot is paying absolutely no attention to days of 403 responses on the feed URLs it's trying to fetch. It keeps on trying to fetch them at great volume, to the tune of 245 requests so far today for 11 different URLs.)

Sidebar: Some more details

First, this really is Googlebot; it comes from Google IP address ranges and from specific IPs with crawl-*.googlebot.com reverse DNS such as 66.249.66.130.

Second, in the past Googlebot has shown signs of supporting conditional GET on syndication feeds. I have historical logs that show Googlebot getting 304's on syndication feed URLs.

Third, based on historical logs I have for my personal website, this appears to have started happening there around January 13th. There are sporadic requests for feed URLs before then, but January 13th is when things light up with multiple requests a day.

web/GooglebotCrawlingFeeds written at 23:33:34; Add Comment

Cool URL fragments don't change either

I was all set to write an entry about how a limitation of dealing with site changes (from one domain to another, from HTTP to HTTPS, or just URL restructuring) via HTTP redirects was that URL fragments fell off during the redirections when I decided to actually check the end URLs I was getting and discovered that I was wrong. Browsers do preserve URL fragments during redirection (although you may not see this if the URL fragment is cut off because the full URL is long). What was really going on in my case is that the site I was dealing with has violated a sub-rule of 'Cool URLs don't change'.

Simply stated, the sub-rule is 'URL fragments are part of the URL'. Let me rephrase that:

If you really care about cool URLs, you can't change any HTML anchors once you create them.

The name of the anchor must remain the same and the meaning (the place it links to) must also stay. This is actually a really high bar, probably an implausibly high one, since HTML anchors are often associated with very specific information that can easily become invalid or otherwise go away (or simply be broken out to a separate page when it becomes too big).

Note that this implies that simply numbering your HTML anchors in sequential order is a terrible thing to do unless you can guarantee that you're not going to introduce or remove any sections, subsections, etc. It's much better to give them some sort of name. Effectively the anchor text should be seen as a unique and stable permanent identifier. Again this is a pretty high bar and is probably going to cause you heartburn if you try to really carry it out.

This somewhat tilts me towards a view that HTML anchors should be avoided. On the other hand it's often easier to read a large page of lots of information (exactly the situation that calls for HTML anchors and an index at the top) rather than keep clicking through a lot of small pages. Today's side moral is that web page design can be hard.

(I'd say that the right answer is small pages with some JavaScript so that one page seamlessly transitions to the next one as you read without you having to do anything, but even that's not a complete solution since you don't get things like 'search in (full) page'.)

I suppose what I really wish is that web servers got URL fragments in the URL of the HTTP request but normally ignored them. Then they could pay attention to the fragment identifier when doing redirects and do the right thing if they had any opinions about it. But this is a wish that's a couple of decades too late; I might as well wish for pervasive encryption while I was at it.

web/CoolUrlFragments written at 01:59:45; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.