A peril of having a highly dynamic web site

September 26, 2005

WanderingThoughts, this blog, is built on top of DWiki, my sprawling wiki-oid program. DWiki is layered on top of a normal Unix directory hierarchy and does things by having different ways of looking at it (called 'views' in DWiki terminology); all of DWiki's support for blogs is implemented with general features that can be used on any directory. The only difference between WanderingThoughts and say the DWiki help area is that WanderingThoughts is set up to default to the 'blog' view and has some template skinning to add the sidebar.

(This way I could tell myself I was just writing some small additional features for my existing program, instead of yet another blogging system.)

Other things are also done as general features. The calendar and range based blog navigation is 'virtual directories' that can be applied to any real directory. Atom syndication feeds are just another view of a directory hierarchy, any directory hierarchy. Because you can compose this generality together, it's trivial to do things like get an Atom feed of the five most recently changed pages in all of CSpace; just tack on '/latest/5/' to the root CSpace URL of /~cks/space/, then add '?atom' to select the Atom syndication view, and it all works.

All of this is very general and dynamic (since everything is generated on the fly). And therein lies our peril, right at the intersection of all of these dynamic website features. Follow along:

  1. Every directory has an Atom feed.
  2. Blog calendar navigation creates lots of links to lots of (virtual) directories.
  3. Every page with a regular Atom feed has automatic feed discovery enabled, because this is the friendly thing to do.

Between the top level directory plus category subdirectories, times day and month and year pages for every day with posts, WanderingThoughts probably has thousands of subdirectories. Each of these directories has its own Atom syndication feed, each of which can be autodiscovered by anything that crawls CSpace through those handy links.

Boy, I hope that any crawlers doing that are smart enough to realize they have a bunch of duplicate feeds.

Boy am I an optimist.

(DWiki can't mark the calendar navigation links 'nofollow', because I want web spiders to follow them to find older blog entries. How else are they going to do it? (Web spiders not infrequently shy away from links with '?' or other URL parameters in them, which makes me nervous about counting on the 'See As Normal' link to lead spiders to plain directory traversal.))

Written on 26 September 2005.
« Weekly spam summary on September 24th, 2005
Some hints on debugging memory leaks in Python programs »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Sep 26 01:23:02 2005
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.