Why web spiders should not crawl syndication feeds

May 29, 2008

On the surface, crawling syndication feeds looks like an attractive idea for web spider operators (although I am not convinced that the metadata they get is on the whole any better than the metadata on web pages). But as things are today, it is a terrible idea and is highly likely to provoke bad reactions if attempted.

The big problem is that right now, if you turned a web spider loose on syndication feeds it would pull far too many of them. This is because people (and websites) have lots of feeds that are either empty or contain overlapping content, but there's no way for a web spider to tell beforehand about this sort of thing. Pulling anyways is bad, because web spiders pulling those feeds puts a pointless burden on web sites (the spider gets nothing new out of it, but the web site is forced to generate and send the data). And this is not just a theoretical issue, as feed 'over-pulling' has affected actual people with real websites.

Or in short, there currently is no good way to do automated discovery of syndication feeds that people actually want spiders to pull. Since there are clearly lots of feeds that are pointless to pull, and since the current default is not to pull feeds, changing the default to 'we pull feeds unless you tell us not to' is going to get bad reactions.

(I suspect the reaction of most people would be 'we refuse to mark up our websites so that you'll stop abusing us', followed by strategic additions to robots.txt.)

An associated issue is that repeatedly pulling syndication feeds has additional requirements in order to not be antisocial, and web spiders have traditionally not done very well at following these requirements. Widespread repeated crawling of syndication feeds would make this even more irritating and painful for web site operators than it already is (especially since getting well behaved web spiders is hard enough as it is).


Comments on this page:

From 207.108.209.160 at 2008-05-29 14:14:39:

If a set of crawlers defaults to not hitting your feeds you can then specify that they should crawl one with an XML Sitemap - http://www.sitemaps.org/. Most of the major search engines honor them and will check for ones listed in your robots.txt. Unfortunately I don't think there's a way to get a well-behaved robot to ignore everything but a specific set of urls without a lot of robots.txt entries covering specific files and folders. There's no obvious exception mechanism. The only big search engine that seems to be borderline-badly behaved is Ask.com, despite a robots.txt disallowing a large set of now bad urls they're continuing to try to hit them, apparently in an attempt to verify that they're gone.

- Kate

Written on 29 May 2008.
« What promiscuous mode does on modern networks
Users are rational »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Thu May 29 00:10:48 2008
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.