Blindly trying to copy a web site these days is hazardous

August 24, 2016

The other day, someone pointed a piece of software called HTTrack at Wandering Thoughts. HTTrack is a piece of free software that makes offline copies of things, so I presume that this person for some reason wanted this. I don't think it went as they intended and wanted.

The basic numbers are there in the logs. Over the course of a bit over 18 hours, they made 72,393 requests and received just over 193 MBytes of data. Needless to say, Wandering Thoughts does not have that many actual content pages; at the moment there are a bit over 6400 pages that my sitemap generation code considers to be 'real', some of them with partially duplicated content. How did 6400 pages turn into 72,000? Through what I call 'virtual directories', where various sorts of range based and date based views and so on are layered on top of an underlying directory structure. These dynamic pages multiply like weeds.

(I'm reasonably sure that 72,000 URLs doesn't cover them all by now, although I could be wrong. The crawl does seem to have gotten every real page, so maybe it actually got absolutely everything.)

Dynamic views of things are not exactly uncommon in modern software, and that means that blindly trying to copy a web site is very hazardous to your bandwidth and disk space (and it is likely to irritate the target a lot). You can no longer point a simple crawler (HTTrack included) at a site or a URL hierarchy and say 'follow every link', because it's very likely that you're not going to achieve your goals. Even if you do get 'everything', you're going to wind up with a sprawling mess that has tons of duplicated content.

(Of course HTTrack doesn't respect nofollow, and it also lies in its User-Agent by claiming to be running on Windows 98. For these and other reasons, I've now set things up so that it will be refused service on future visits. In fact I'm in a sufficiently grumpy mood that anything claiming to still be using Windows 98 is now banned, at least temporarily. If people are going to lie in their User-Agent, please make it more plausible. In fact, according to the SSL Server Test, Windows 98 machines can't even establish a TLS connection to this server. Well, I'm assuming that based on the fact that Windows XP fails, as the SSL Server Test doesn't explicitly cover Windows 98.)

PS: DWiki and this host didn't even notice the load from the HTTrack copy. We found out about it more or less by coincidence; a university traffic monitoring system noticed a suspiciously high number of sessions from a single remote IP to the server and sent in a report.


Comments on this page:

By Aneurin Price at 2016-08-25 17:38:47:

It's clearly somebody with a poor or intermittent internet connection trying to get reference material available offline, who's deliberately kept their requests to one per second so as to be certain that they're not causing anyone any problems.

In the unlikely event that I noticed this on the websites I administer (in between the flood of bogus connections trying to exploit SSL bugs), I can't imagine myself being even slightly annoyed, so I'm curious as to why this makes you so angry.

By cks at 2016-08-25 23:56:31:

To be clear, I'm not angry, just vaguely irritated, and there's a few reasons for that. The first is that despite not having any real impact on the system, this was a drastic increase in traffic; those 72,000 requests by themselves were a bit over three times the size of the usual daily volume. And they were almost all useless requests and thus useless traffic, which simply requested tens of thousands of variations of index pages.

Another is that anyone who wants to get all of the content here can do so very easily if they pay even the littlest bit of attention. There is a 'Full index of entries' link in the sidebar; point your crawler at that page, tell it to follow one level of links, and you're almost completely done. With a little more work you'll have all of the comments as well. If someone is interested in the contents of Wandering Thoughts but can't be bothered to go to even that level of effort and attention, I am not very interested in helping give them a ton of index pages. People deploying HTTrack against here are very likely to be doing so blindly, so.

(I also have long-standing views that web spiders had better respect nofollow, cf. HTTrack is a web spider (and explicitly checks robots.txt), but violates this.)

By Aneurin Price at 2016-08-26 09:14:39:

Yeah, that makes sense.

By the way, not that you have any reason to care, but it amused me for a few minutes: I tried accessing https://utcc.utoronto.ca/ in Windows 98. The result is that IE 5 (which it comes with out of the box) throws the expected SSL error, but Opera 9.64 (the last version that officially supported Win 98) works just fine.

So now you know :-).

To be fair to the HTTrack developers, there are an annoying number of lazy admins which will ban well-behaved scrapers simply because it's easier to set over-broad robots.txt rules and string-match honest User Agents than to analyze logs for bad behaviour ...even if (like me), you go above and beyond (writing a custom scraper) to ensure you download only what you need and to cache more heavily (only ever one HTTP request for a given URL, even if that causes staleness issues) and download fewer supplementary resources (none) than real browsers.

In one case (Fanfiction.net), they are so anti-bot that, for a long time, they set a blanket robots.txt ban up to and including GoogleBot.

I went so far as to write a Python helper module to automate the process of harvesting identifying headers like User-Agent from the default browser on the system and waiting a random amount of time between requests (currently an implementation of how the wget manpage describes --random-wait, but it's more about abstracting it with an API so I'm free to "real humans aren't that random" statistical analysis attacks if necessary.

(If they want to ban me for automating drudge-work like polling image board threads for updates once a day or generating an LRF eBook from a fanfic before I run out the door, they can try to pick me out among the much less polite behaviour of actual web browsers. Hell, if they target my bots for being too polite, I'll start downloading CSS, scripts, images, and fonts that will be discarded in my HTML post-processing, just to muddy the waters.)

I'm sure you will disagree with me as you have a strong opinion on nofollow. but I disagree with you I will explain why.

The universal accepted way to forbid exploration by automated program is robots.txt.

By trying to use nofollow to control it, you try to game the system.

First it was intended to use only by search engine, not general scrapper.

Second it was intended to avoid use of spam on blog and wiki for SEO purpose, never for controlling robots.

Third you can only control nofollow on your own page, so your method has intrinsically a flaw because other pages can link to it.

The fundamental reason you chose to game the system is that you chose an url scheme not compatible with robots.txt which preexisted at the creation of your site. Moreover, you could switch your url mapping to a robots compatible mapping with proper redirection to keep working existent url. I really feel your attitude like I'm right and the whole external world is wrong.

Written on 24 August 2016.
« more, less, and a story of typical Unix fossilization
The single editor myth(ology) »

Page tools: View Source, View Normal.
Search:
Login: Password:

Last modified: Wed Aug 24 22:18:27 2016
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.