Blindly trying to copy a web site these days is hazardous
The other day, someone pointed a piece of software called HTTrack at Wandering Thoughts. HTTrack is a piece of free software that makes offline copies of things, so I presume that this person for some reason wanted this. I don't think it went as they intended and wanted.
The basic numbers are there in the logs. Over the course of a bit over 18 hours, they made 72,393 requests and received just over 193 MBytes of data. Needless to say, Wandering Thoughts does not have that many actual content pages; at the moment there are a bit over 6400 pages that my sitemap generation code considers to be 'real', some of them with partially duplicated content. How did 6400 pages turn into 72,000? Through what I call 'virtual directories', where various sorts of range based and date based views and so on are layered on top of an underlying directory structure. These dynamic pages multiply like weeds.
(I'm reasonably sure that 72,000 URLs doesn't cover them all by now, although I could be wrong. The crawl does seem to have gotten every real page, so maybe it actually got absolutely everything.)
Dynamic views of things are not exactly uncommon in modern software, and that means that blindly trying to copy a web site is very hazardous to your bandwidth and disk space (and it is likely to irritate the target a lot). You can no longer point a simple crawler (HTTrack included) at a site or a URL hierarchy and say 'follow every link', because it's very likely that you're not going to achieve your goals. Even if you do get 'everything', you're going to wind up with a sprawling mess that has tons of duplicated content.
(Of course HTTrack doesn't respect nofollow,
and it also lies in its
User-Agent by claiming to be running on
Windows 98. For these and other reasons, I've now set things up so
that it will be refused service on future visits. In fact I'm in a
sufficiently grumpy mood that anything claiming to still be using
Windows 98 is now banned, at least temporarily. If people are going
to lie in their
User-Agent, please make it more plausible. In
fact, according to the SSL Server Test, Windows 98 machines
can't even establish a TLS connection to this server. Well, I'm assuming that based on the
fact that Windows XP fails, as the SSL Server Test doesn't explicitly
cover Windows 98.)
PS: DWiki and this host didn't even notice the load from the HTTrack copy. We found out about it more or less by coincidence; a university traffic monitoring system noticed a suspiciously high number of sessions from a single remote IP to the server and sent in a report.