The unreasonable effectiveness of web crawlers

December 2, 2014

I have a few test copies of Wandering Thoughts and all of CSpace sitting around here and there; I use them for things like trying out new CSS and other layout stuff, playing with code changes, testing the full weight of CSpace in different web environments, and so on. As it happens, one of those copies sometimes exists on my personal domain. I don't link to these copies from anywhere, of course, as they're test things and I access them from direct URLs. So you can imagine my surprise when one day I discovered that Googlebot and several other crawlers were rummaging through that copy on my personal domain. Of course I became very curious about how they could possibly have found it.

The answer turned out to be that lurking in the DWiki install on my personal domain was a single stray file copied over from CSpace that had a link to '/~cks/'. This link had been there for years but equally had led to nothing for those years until I brought up the test install and left it there. Crawlers had been trying the link all that time and getting 404s on it, but within a few days of the link switching to working Googlebot tried the URL again, found a page there, and started crawling through the links it found (another crawler also showed up). And it was crawling quite enthusiastically at that, not going all that slowly.

(Fortunately I noticed almost immediately and turned the whole thing off again. This was mostly luck; I was watching the logs because I'd been doing some experimentation, so I actually noticed the explosion in traffic volume. Normally I don't look at the logs there for long periods of time.)

What this has shown me rather vividly is that web crawlers are unreasonably effective. If there's a link to something lurking somewhere, no matter how obscure, they're likely to find it, follow it, and crawl everything behind it. Of course I already knew this in theory, since there have been all sorts of stories over the years of search engines indexing things that no one expected them to stumble over (or to stumble over that fast), but it's one thing to read the stories and another thing to have it happen to you.

(The next time around I'll try to remember to put access restrictions up for whatever I'm testing. And to do it before I bring up the test setup.)

Written on 02 December 2014.
« You should keep your system logs for longer than you probably are
Security capabilities and reading process memory »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Tue Dec 2 00:18:18 2014
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.