Wandering Thoughts archives


Why web spiders should not crawl syndication feeds

On the surface, crawling syndication feeds looks like an attractive idea for web spider operators (although I am not convinced that the metadata they get is on the whole any better than the metadata on web pages). But as things are today, it is a terrible idea and is highly likely to provoke bad reactions if attempted.

The big problem is that right now, if you turned a web spider loose on syndication feeds it would pull far too many of them. This is because people (and websites) have lots of feeds that are either empty or contain overlapping content, but there's no way for a web spider to tell beforehand about this sort of thing. Pulling anyways is bad, because web spiders pulling those feeds puts a pointless burden on web sites (the spider gets nothing new out of it, but the web site is forced to generate and send the data). And this is not just a theoretical issue, as feed 'over-pulling' has affected actual people with real websites.

Or in short, there currently is no good way to do automated discovery of syndication feeds that people actually want spiders to pull. Since there are clearly lots of feeds that are pointless to pull, and since the current default is not to pull feeds, changing the default to 'we pull feeds unless you tell us not to' is going to get bad reactions.

(I suspect the reaction of most people would be 'we refuse to mark up our websites so that you'll stop abusing us', followed by strategic additions to robots.txt.)

An associated issue is that repeatedly pulling syndication feeds has additional requirements in order to not be antisocial, and web spiders have traditionally not done very well at following these requirements. Widespread repeated crawling of syndication feeds would make this even more irritating and painful for web site operators than it already is (especially since getting well behaved web spiders is hard enough as it is).

web/WhyNoFeedCrawling written at 00:10:48; Add Comment

By day for May 2008: 1 2 3 4 5 6 7 8 9 11 12 13 15 17 18 19 20 21 23 24 25 26 28 29 30 31; before May; after May.

Page tools: See As Normal.
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.