Wandering Thoughts archives


The limits of web spider tolerance

I have an important message for web spider operators: our generosity is not unlimited. In fact, it's probably running out.

There are a lot of web spiders out there, and many of these spiders don't seem to be offering anything for free to the public. When you crawl to build a private index, you're building a business in part off our resources, which you are using for free, and there is very little in this for us. To put it plainly, such spider operators are parasites that are counting on us to not really notice their spider-bites.

Like most websites, we've got a thick skin and large reserves of generosity. But it's not unlimited, and it's already worn out for some people. Moreover, I believe that being a parasite is not a good way to be viable in the long term (and it's certainly not a good way to make people like you).

If you are considering a parasitic spider business today, do the honest and simple thing: buy access to Alexa's data. (If you can't afford this, how on earth are you going to afford the infrastructure to do decent web crawling?)

If you believe you have a non-parasitic spider business, you'd better have a clear and compelling explanation of what's in it for us. What do we, or the general public, get out of letting you consume our resources?

(For a hair-raising list of web spiders and their apparent purposes, see Edith Frost's Spot the Bot entry.)

web/SpiderToleranceLimits written at 00:57:41; Add Comment

Page tools: See As Normal.
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.