2007-11-26
The two things I can mean by 'web spider'
My problem with the term 'web spider' is that I wind up using it to mean two related but different things, mostly because I don't know of a good term for the second one.
One sense of the term is that a web spider is any reasonably autonomous program that navigates around the web, poking web URLs, regardless of what it is doing. If you automatically crawl a site and there is not a human sitting there supervising you, you're a web spider.
(By this definition, not all things that automatically fetch web pages are web spiders; the Google Web Accelerator is not, for example, since it does not crawl things autonomously.)
The other sense of the term is that a web spider is an autonomous program that is non-malicious, honest, and well behaved while it crawls the web. A web spider respects robots.txt, for example; in fact, part of the consensus definition of a web spider is that it does so. You could say that the core of this sort of web spider is that they want to be socially acceptable; they want to be legitimate web spiders.
My problem with using web spider in the first sense is that I usually feel that it generally includes too much to be useful, because it includes all sorts of malicious programs, things that have no interest in trying to be socially acceptable (in many cases because their goals are not considered socially acceptable, such as looking for things to spam). As a result, I tend to use 'web spider' in the second sense a fair bit without footnoting it, such as when I talked about my minimum standards for web spiders.
(There is no point in talking about how we want the whole of the first sense to behave, because the first sense includes programs that aren't going to.)
In other words, I mostly consider real web spiders to be programs that fall into the second sense of the term, and then I just drop the 'real' qualifier a lot.
2007-11-23
My web spider technical requirements
I have been looking at the web server logs lately, which is usually an activity that's calculated to put me into a grumpy mood. As a result of this, I have decided to write down in one place my technical requirements for web spiders, the direct things that I expect them to do and not do.
(This is excludes issues of basic functionality, like correctly parsing
HTML, resolving relative links, and making correct HTTP requests, and
basic proper web spider behavior like respecting robots.txt.)
So, my minimum technical requirements for spiders:
- obey
nofollow. This is not optional. - do not send
Referer:headers, especially ones chosen at random. - do not issue
GETrequests for URLs you've only seen asPOSTtargets. - do not crawl binary files repeatedly, especially when they are large.
- do not crawl clearly marked syndication feeds, especially repeatedly.
(What I consider a binary file instead of web content is fuzzy, but my minimal list is things like ISO images, tar files, RPMs, and audio and video files. I do not care if you want to build an MP3 or ISO search engine; such files are too large to be pulled automatically.)
Not following this list means that I will block your spider out of hand the moment I spot you in my logs, and violating any of these stands out like a sore thumb when I actually look. Following it doesn't mean that I'll actually like your spider, but that's another entry.
(So many spiders do not obey nofollow that I usually only do partial
blocks for that.)
Disclaimer: I reserve the right to amend this list as more things occur to me, probably as the result of seeing yet more badly behaved web spiders.
2007-11-01
Dynamic rendering versus static rendering for websites
One of the long-standing splits for dynamic websites is between dynamic content rendering and static content rendering (where you use tricks like 404 handlers to generate a page's content only once, write it to a static file, and then afterwards just serve the static file).
(This assumes that static rendering is viable; for heavily personalized sites it may be more trouble than it's worth, if you are already effectively never generating the same page twice anyways.)
Fundamentally the choice is tradeoff of programmer time versus performance. A fully dynamic system behaves worse, requiring more resources to handle the same volume, but you don't have to deal with cache invalidation (and its close cousin, concurrent cache refill). Often this tradeoff is worth it, especially if you can shim in a simple caching layer to improve performance if you turn out to need it; most websites will never be hit with all that high a load, especially a load high enough that programmer time is cheaper than more hardware.
(The problem of cache invalidation in static rendering is that you need to keep track of all of the automatically generated static pages that your update may invalidate. For some websites, this is pretty easy; for others, it may be quite difficult. This may argue that websites, like programs, benefit from avoiding too many interconnections between their bits.)
Although I like the simplicity of dynamic rendering for my own work, I find the static rendering approach neat and cool, and in some ways it feels like the right thing to do (especially the extreme versions that pre-render your entire website). Plus, of course, it's got great performance.
(And it doesn't feel like cheating the way other sorts of caching
sometimes do, partly because at the
extreme it's no more a cache than make is. If anything, doing dynamic
rendering of largely static websites is what feels like cheating, in
that if I was smart and clever enough I could do it all with make
and a static rendering engine; I do it dynamically because it's easier
and I'm lazy.)