Web spiders (or people) can invent unfortunate URLs for your website

February 2, 2025

Let's start with my Fediverse post:

Today in "spiders on the Internet do crazy things": my techblog lets you ask for a range of entries. Normally the range that people ask for is, say, ten entries (the default, which is what you normally get links for). Some deranged spider out there decided to ask for a thousand entries at once and my blog engine sighed, rolled up its sleeves, and delivered (slowly and at large volume).

In related news, my blog engine can now restrict how large a range people can ask for (although it's a hack).

DWiki is the general wiki engine that creates Wandering Thoughts. As part of its generality, it has a feature that shows a range of 'pages' (in Wandering Thoughts these are entries, in general these are files in a directory tree), through what I call virtual directories. As is usual with these things, the range of entries (pages, files) that you're asking for is specified in the URL, with syntax like '<whatever>/range/20-30'.

If you visit the blog front page or similar things, the obvious and discoverable range links you get are for ten entries. You can under some situations get links for slightly bigger ranges, but not substantially larger ones. However, the engine didn't particularly restrict the size of these ranges, so if you wanted to create URLs by hand you could ask for very large ranges.

Today, I discovered that two IPs had asked for 1000-entry ranges today, and the blog engine provided them. Based on some additional log information, it looks like it's not the first time that giant ranges have been requested. One of those IPs was an AWS IP, for which my default assumption is that this is a web spider of some source. Even if it's not a conventional web spider, I doubt anyone is asking for a thousand entries at once with the plan of reading them all; that's a huge amount of text, so it's most likely being done to harvest a lot of my entries at once for some purpose.

(Partly because of that and partly because it puts a big load on DWiki, I've now hacked in a mentioned feature to restrict how large a range you can request. Because it's a hack, too-large ranges get HTTP 404 responses instead of something more useful.)

Sidebar: on the "virtual directories" name and feature

All of DWiki's blog parts are alternate views of a directory hierarchy full of files, where each file is a 'page' and in the context of Wandering Thoughts, almost all pages are blog entries (on the web, the 'See as Normal' link at the bottom will show you the actual directory view of something). A 'virtual directory' is a virtual version of the underlying real directory or directory hierarchy that only shows some pages, for example pages from 2025 or a range of pages based on how recent they are.

All of this is a collection of hacks built on top of other hacks, because that's what happens when you start with a file based wiki engine and decide you can make it be a blog too with only a few little extra features (as a spoiler, it did not wind up requiring only a few extra things). For example, you might wonder how the blog's front page winds up being viewed as a chronological blog, instead of a directory, and the answer is a hack.


Comments on this page:

Hi Chris,

Where is the front page? (I know that this question may sound stupid and entitled.) The page with the lowest human mental overhead that I found was "__IndexChron" as opposed to the alphabetical index. It’s easy to see new posts and also easy to browse the list for some older posts that might be worth re-reading. With the chronological listing it seems to yield a more ergonomic mental map.

Is there a way to add something like an abbreviated chronological index (10 entries) to keep server load down for readers like me? (Again, please, ignore in case this is fantastically entitled.)

Thank you, for sharing your time.

By cks at 2025-02-03 00:37:46:

What I consider Wandering Thought's front page is here; it presents the usual reverse chronological listing of the most recent few entries, but with full text. There's currently no 'N most recent entries with only titles', in the style of __IndexChron, although you can get a by-year view (which is basically always going to present titles only) or a by-month view (which usually will, except at the start of the month where there's only a few entries). I'm probably unlikely to add one for various reasons, including that it really belongs as part of a wider rethink of the traditional blog navigation (also, also, among others).

Written on 02 February 2025.
« Build systems and their effects on versioning and API changes
Why writes to disk generally wind up in your OS's disk read cache »

Page tools: View Source, View Normal.
Search:
Login: Password:

Last modified: Sun Feb 2 19:55:57 2025
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.