Wandering Thoughts archives


One reason that it is so hard to challenge Google

Chris Linfoot has been reacting to Cuil, and in the process he shows one somewhat unexpected reason that it is so hard for a new search engine to challenge Google today. Not only do you have to build the necessary software and infrastructure, but it is crucial that your web spider be completely, utterly well behaved.

If your spider is not completely well behaved, people will notice, talk about it, and then block you. This results in you having less data to build your search results on and likely worsen the quality of those results, which is a killer; if you cannot produce search results at least as good as Google, no one is going to be very interested in you.

(That the sort of people who notice badly behaved spiders are the sort of technical people who might otherwise be your early adopters probably doesn't help, either. I know that I was predisposed against Cuil because of this issue.)

Now, I'd certainly hope that anyone who aspires to challenge Google can write a well behaved spider (and I'm not particularly sad that having one is a strong requirement). But bugs can and do happen, including in spiders, and this issue basically means that you need to have your spider more or less perfect all the time (or at least all the time that you are letting it crawl the net). I hope that you have a lot of internal tests for it, because you are going to need them.

To add to your challenges, I rather suspect that people don't often go back to reconsider robots.txt entries; once you're in, you're staying in unless something major happens. So a transient spider issue that you fixed in a week may have lingering effects for years (especially if 'block spider <X>, it is badly behaved' becomes a bit of folklore).

(Renaming your spider won't help, because then people will start jumping to the conclusion that you're trying to get around their robots.txt blocks, which is going to worsen your reputation all that much faster.)

As a side note: it is hopefully obvious that it doesn't matter if Google's web spider is sometimes badly behaved, because people will cut Google slack that they will not cut anyone else as a natural consequence of Google's current dominance.

web/SpiderBehaviorChallenge written at 00:55:38; Add Comment

Page tools: See As Normal.
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.