2005-11-15
The scope of the peril of having a highly dynamic web site
In ADynamicSitePeril, I wrote about how dynamically generating various aspects of this blog means that WanderingThoughts has a lot of Atom feeds. Since MSN Search's aggressive fetching of several hundred of those feeds has been on my mind recently, I thought it would help to put some concrete numbers on this.
WanderingThoughts is built on top of a directory hierarchy, where each entry is a file and categories are subdirectories. Right now (before I post this entry), it has 10 directories (nine 'categories' and the top level), 186 entries, and four administrative files (two entry indexes, the recent comments index, and the sidebar).
A web spider that doesn't crawl through links marked 'nofollow' will see 375 directories (each with an Atom feed). A web spider that crawls through all links will see 964 directories, with the additional directories coming mostly from crawling all of the range-based 'previous N entries' links.
(It's a popular belief that marking links 'nofollow' means spiders will never crawl through them. Technically this is false; the original description just calls for the link to give no credit to the target. In practice, all of the common web spiders just don't follow 'nofollow' links, so you can abuse them this way.)
Over the course of the last 28 days, MSNBot fetched 365 different Atom feeds in WanderingThoughts, with the most recently created feed dating from November 11th. Since each entry typically creates two new virtual directories, MSNBot came very close to finding every non-nofollow virtual directory that existed as of last Friday.