Wget is not welcome here any more (sort of)

December 3, 2018

Today, someone at a large chipmaker that will go unnamed decided (or apparently decided) that they would like their own archived copy of Wandering Thoughts. So they did what one does here; they got out wget, pointed it at the front page of the blog, and let it go. I was lucky in a way; they started this at 18:05 EST and I coincidentally looked at my logs around 19:25, at which point they had already made around 3,000 requests because that's what wget does when you turn it loose. This is not the first time that people have had the bright idea to just turn to wget to copy part or all of Wandering Thoughts (someone else did it in early October, for example), and it will not be the last time. However, it will be the last time they're going to be even partially successful, because I've now blocked wget's default User-Agent.

I'm not doing this because I'm under any illusions that this will stop people from grabbing a copy of Wandering Thoughts, and in fact I don't care if people do that; if nothing else, there are plenty of alternatives to wget (starting with, say, curl). I'm doing this because wget's spidering options are dangerous by default. If you do the most simple, most obvious thing with wget, you flood your target site and perhaps even spill over from it to other sites. And, to be clear and in line with my general views, these unfortunate results aren't the fault of the people using wget. The people using wget to copy Wandering Thoughts are following the obvious path of least resistance, and it is not their fault that this is actually a bad idea.

(I could hope that someday wget will change its defaults so that they're not dangerous, but given the discussion in its manual about options like --random-wait, I am not going to hold my breath on that one.)

Wget is a power tool without adequate safeguards for today's web, so if you are going to use it on Wandering Thoughts, all I can do is force you to at least slow down, go out of your way a little bit, and perhaps think about what you're doing. This doesn't guarantee that people who want to use wget on Wandering Thoughts will actually set it up right so that it behaves well, but there is now at least a chance. And if they configure wget so that it works but don't make it behave well, I'm going to feel much less charitable about the situation; these people will have chosen to deliberately climb over a fence, even if it is a low fence.

As a side note, one reason that I'm willing to do this at all is that I've checked the logs here going back a reasonable amount of time and found basically no non-spidering use of wget. There is a trace amount of it and I am sorry for the people behind that trace amount, but. Please just switch to curl.

(I've considered making my wget block send a redirect to a page that explains the situation, but that would take more energy and more wrestling with Apache .htaccess than I currently have. Perhaps if it comes up a lot.)

PS: The people responsible for the October incident actually emailed me and were quite apologetic about how their wget usage had gotten away from them. That it did get away from them despite them trying to do a reasonable job shows just how sharp-edged a tool wget can be.

PPS: I'm somewhat goring my own ox with this, because I have a set of little wget-based tools and now I'm going to have to figure out what I want to do with them to keep them working on here.


Comments on this page:

From 8.25.197.27 at 2018-12-04 15:43:52:

What about something like this:

User-agent: wget Crawl-Delay: 10

in your robots.txt file.

By Anonymous at 2018-12-05 08:57:53:

If someone did want to use wget to mirror your site, this would be a good place for you to suggest the options you would like them to use.

I agree with your speed bump.

I have long advocated to friends and colleagues that they should close, and lock, every gate, even if it's a lowly 3 foot tall picket fence.

It's enough to provide a clear demarcation that someone must make a small effort to step over. A distinction that is quite important in a number of circumstances.

By cks at 2018-12-17 11:12:30:

Belatedly: a crawl delay wouldn't fix things, because fast fetching is only one of wget's problems. I also believe that for most people, any substantial crawl delay is equivalent to disabling wget; people mostly don't want to be sitting there for days waiting for wget to finish.

As far as suggesting wget options, that would require research work and that's up to the people who want to use wget on my site. I don't, so they get to do the work. If I actively wanted people to make copies of here I would have suggested ways of doing so, but as it is I am merely passively okay with it.

Written on 03 December 2018.
« Linux disk IO stats in Prometheus
The brute force cron-based way of flexibly timed repeated alerts »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Dec 3 21:20:05 2018
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.