2007-11-26
The two things I can mean by 'web spider'
My problem with the term 'web spider' is that I wind up using it to mean two related but different things, mostly because I don't know of a good term for the second one.
One sense of the term is that a web spider is any reasonably autonomous program that navigates around the web, poking web URLs, regardless of what it is doing. If you automatically crawl a site and there is not a human sitting there supervising you, you're a web spider.
(By this definition, not all things that automatically fetch web pages are web spiders; the Google Web Accelerator is not, for example, since it does not crawl things autonomously.)
The other sense of the term is that a web spider is an autonomous program that is non-malicious, honest, and well behaved while it crawls the web. A web spider respects robots.txt, for example; in fact, part of the consensus definition of a web spider is that it does so. You could say that the core of this sort of web spider is that they want to be socially acceptable; they want to be legitimate web spiders.
My problem with using web spider in the first sense is that I usually feel that it generally includes too much to be useful, because it includes all sorts of malicious programs, things that have no interest in trying to be socially acceptable (in many cases because their goals are not considered socially acceptable, such as looking for things to spam). As a result, I tend to use 'web spider' in the second sense a fair bit without footnoting it, such as when I talked about my minimum standards for web spiders.
(There is no point in talking about how we want the whole of the first sense to behave, because the first sense includes programs that aren't going to.)
In other words, I mostly consider real web spiders to be programs that fall into the second sense of the term, and then I just drop the 'real' qualifier a lot.