The drawback of having a dynamic site with a lot of URLs on today's web

February 5, 2020

Wandering Thoughts, this blog, is dynamically generated from an underlying set of entries (more or less). As is common with dynamically generated sites, which often have a different URL structure than static sites, it has a lot of automatically generated URLs that provide various ways of viewing and accessing the underlying entries, and of course it creates links to those URLs in various places. Once upon a time this was generally fine and I didn't think much about it. These days my attitude has changed and I'm increasingly thinking about how to reduce the number of these automatically generated links (and perhaps to remove some of the URLs themselves).

The issue is that on the modern web, everything gets visited (even things behind links flagged as nofollow, although I wish otherwise). This includes automatically generated pages that are either pointless duplication or actively useless, for example because they don't have any actual content. If you have a highly dynamic web site that generates a lot of these URLs and links to them, sooner or later you'll waste time generating and serving these pages to robots that don't really care.

All of that sounds nicely abstract, so let me be concrete. Every entry on Wandering Thoughts can potentially have comments, and long ago I decided that I should provide syndication feeds for comments (as well as entries) and then put in links to the relevant syndication feeds for any page at the bottom of it (pages that are directories have syndication feeds for entries and comments). Plenty of my entries don't have any comments and so the comment syndication feed for them is empty, and even for entries that do have comments the feed is almost entirely static (since new comments are rare, especially on old entries). But because there are links to all of these comment syndication feeds, they get periodically fetched by crawlers.

(To put concrete numbers on this, in the past ten days over 2300 different entries here have had their comment syndication feeds retrieved, a number of them repeatedly. Some of the fetching is from web spiders that admit it, some of it may be from people clicking on links out of curiosity, and some of it is certainly from people and software that are cloaking their real activity. Over ten days, this is not gigantic.)

Once upon a time I would have had no qualms about exposing all of these comment syndication feed links as the right thing to do even in the face of this. These days I'm not so sure. I'm not going to make the feeds themselves go away, but I am considering not putting in the links in at least some cases (for example, if there are no comments on the entry). That would at least reduce how many visible links Wandering Thoughts generate, and somewhat reduce the pointless crawling and indexing that people are doing.

(I've actually been quietly reducing the number of syndication feeds that Wandering Thoughts exposes for some time, but the previous cases have been easy ones. Interested parties can see the 'atomfeed-virt-only-*' settings in DWiki's configuration file.)

Written on 05 February 2020.
« What 'is' translates to in CPython bytecode
I frequently use dependencies because they enable my programs to exist »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed Feb 5 22:59:33 2020
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.