Blindly trying to copy a web site these days is hazardous

August 24, 2016

The other day, someone pointed a piece of software called HTTrack at Wandering Thoughts. HTTrack is a piece of free software that makes offline copies of things, so I presume that this person for some reason wanted this. I don't think it went as they intended and wanted.

The basic numbers are there in the logs. Over the course of a bit over 18 hours, they made 72,393 requests and received just over 193 MBytes of data. Needless to say, Wandering Thoughts does not have that many actual content pages; at the moment there are a bit over 6400 pages that my sitemap generation code considers to be 'real', some of them with partially duplicated content. How did 6400 pages turn into 72,000? Through what I call 'virtual directories', where various sorts of range based and date based views and so on are layered on top of an underlying directory structure. These dynamic pages multiply like weeds.

(I'm reasonably sure that 72,000 URLs doesn't cover them all by now, although I could be wrong. The crawl does seem to have gotten every real page, so maybe it actually got absolutely everything.)

Dynamic views of things are not exactly uncommon in modern software, and that means that blindly trying to copy a web site is very hazardous to your bandwidth and disk space (and it is likely to irritate the target a lot). You can no longer point a simple crawler (HTTrack included) at a site or a URL hierarchy and say 'follow every link', because it's very likely that you're not going to achieve your goals. Even if you do get 'everything', you're going to wind up with a sprawling mess that has tons of duplicated content.

(Of course HTTrack doesn't respect nofollow, and it also lies in its User-Agent by claiming to be running on Windows 98. For these and other reasons, I've now set things up so that it will be refused service on future visits. In fact I'm in a sufficiently grumpy mood that anything claiming to still be using Windows 98 is now banned, at least temporarily. If people are going to lie in their User-Agent, please make it more plausible. In fact, according to the SSL Server Test, Windows 98 machines can't even establish a TLS connection to this server. Well, I'm assuming that based on the fact that Windows XP fails, as the SSL Server Test doesn't explicitly cover Windows 98.)

PS: DWiki and this host didn't even notice the load from the HTTrack copy. We found out about it more or less by coincidence; a university traffic monitoring system noticed a suspiciously high number of sessions from a single remote IP to the server and sent in a report.

Written on 24 August 2016.
« more, less, and a story of typical Unix fossilization
The single editor myth(ology) »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed Aug 24 22:18:27 2016
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.