The limits of web spider tolerance

January 20, 2006

I have an important message for web spider operators: our generosity is not unlimited. In fact, it's probably running out.

There are a lot of web spiders out there, and many of these spiders don't seem to be offering anything for free to the public. When you crawl to build a private index, you're building a business in part off our resources, which you are using for free, and there is very little in this for us. To put it plainly, such spider operators are parasites that are counting on us to not really notice their spider-bites.

Like most websites, we've got a thick skin and large reserves of generosity. But it's not unlimited, and it's already worn out for some people. Moreover, I believe that being a parasite is not a good way to be viable in the long term (and it's certainly not a good way to make people like you).

If you are considering a parasitic spider business today, do the honest and simple thing: buy access to Alexa's data. (If you can't afford this, how on earth are you going to afford the infrastructure to do decent web crawling?)

If you believe you have a non-parasitic spider business, you'd better have a clear and compelling explanation of what's in it for us. What do we, or the general public, get out of letting you consume our resources?

(For a hair-raising list of web spiders and their apparent purposes, see Edith Frost's Spot the Bot entry.)

Written on 20 January 2006.
« A Python length gotcha
Please have stable ids for your feed entries »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Jan 20 00:57:41 2006
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.