A belated realization about web spiders and your page cache
Like a lot of other web applications, DWiki has various sorts of caching. One of its caching mechanisms is a simple brute force cache for full pages, intended to deal with Slashdot effect situations; if a page has taken 'too long' to generate it's put into the cache, and then further requests for it are served straight from the cache for a short interval.
Just today, I realized that much of what was getting put into the page cache was actually being inserted pointlessly.
Like most any blog, WanderingThoughts has a lot of virtual pages. This means it has a lot of URLs for web spiders to explore. Because it has so many URLs for web spiders to walk compared to actual content, a significant amount of my total traffic is web spiders trying to explore through everything that they can find. Even a vaguely competent web spider is basically never going to re-crawl the same URL within a few seconds or minutes, ie within the time interval where my simple page cache will do any good. The result is then straightforward: adding pages that spiders request to the page cache is pointless because they will never be hit again, or at least not before their cache entry has expired.
Avoiding spiders contaminating your page cache is relatively simple. Because the largest contamination comes from the most active web spiders, you don't have to hunt down all of the spiders active on your site; all you have to do is look at your user agent logs and then make your cache insertion code pass over requests from the most active crawlers that you see. Generally they will jump right out at you.
(Extending this to caches of page components is much more chancy because the possibility of cross-page reuse is much higher.)