The scope of the peril of having a highly dynamic web site

November 15, 2005

In ADynamicSitePeril, I wrote about how dynamically generating various aspects of this blog means that WanderingThoughts has a lot of Atom feeds. Since MSN Search's aggressive fetching of several hundred of those feeds has been on my mind recently, I thought it would help to put some concrete numbers on this.

WanderingThoughts is built on top of a directory hierarchy, where each entry is a file and categories are subdirectories. Right now (before I post this entry), it has 10 directories (nine 'categories' and the top level), 186 entries, and four administrative files (two entry indexes, the recent comments index, and the sidebar).

A web spider that doesn't crawl through links marked 'nofollow' will see 375 directories (each with an Atom feed). A web spider that crawls through all links will see 964 directories, with the additional directories coming mostly from crawling all of the range-based 'previous N entries' links.

(It's a popular belief that marking links 'nofollow' means spiders will never crawl through them. Technically this is false; the original description just calls for the link to give no credit to the target. In practice, all of the common web spiders just don't follow 'nofollow' links, so you can abuse them this way.)

Over the course of the last 28 days, MSNBot fetched 365 different Atom feeds in WanderingThoughts, with the most recently created feed dating from November 11th. Since each entry typically creates two new virtual directories, MSNBot came very close to finding every non-nofollow virtual directory that existed as of last Friday.

Written on 15 November 2005.
« Banning MSNBot: an open letter to MSN Search
How not to do DNS for internal domains »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Tue Nov 15 01:15:42 2005
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.