Good web scraping is not just about avoiding load

January 8, 2022

I'll start with a little thing from Twitter:

@thatcks: Whelp, something calling itself the ArchiveTeam's ArchiveBot just got itself thoroughly banned from my techblog by having a crawling rate measured in requests a second. Over 27,000 requests today so far. That's not how you do it, people. (Let's see if it notices the 403s.)

@<redacted>: I mean, that would be rude in 1999 but in 2022 my watch could serve triple digits per second of a mostly-plaintext blog without breaking a sweat

One of my opinions here is that good web scraping is not just about avoiding load on the target. Ultimately, good web scraping is about being polite. One of the things that's definitely impolite is overloading the target; harming a scraping target is not a good thing. But another thing that's impolite, at least in my view (and my view is what matters for Wandering Thoughts), is simple being too large a source of requests and traffic. And 27,000 requests from a single source is at least one order of magnitude larger than I normally see, and the single largest regular source is itself an unreasonable one.

(As I noted on Twitter, another impolite thing is scraping URLs that people marked in various ways as not to be scraped. If a web scraper obsessively visits every 'write a comment' page here and so on, it's going to get blocked.)

Web robots and web scrapers often don't care about being polite. Politeness isn't a technical issue or a required technical standard, the way properly using HTTP is; politeness is generally optional. But so is allowing web scrapers to scrape things at all. The more polite a web scraper attempts to be, the better the odds that site operators won't object. Conversely, the more high volume and obvious and intrusive the web scrape is, the more likely people are going to have bad reactions.

Of course this raises the question of what exactly is polite behavior. The honest answer is that while I have my own opinions, I don't think there's any Internet consensus, so if you genuinely want to have a polite web scraper, you're going to have to try to work that out for yourself. Expect it to take some work.

The obvious additional thing to note is that even today, there are plenty of sites that cannot sustain multiple requests a second from a web scraper (much less from the web scraper and other traffic). This is unfortunate and people write screeds about how it doesn't have to be this way (I've dabbled in this), but the reality of the Internet is that there are plenty of inefficient slow sites, especially small ones like blogs. Throwing the kind of rapid scraping at those sites is more than impolite, it's damaging, and it doesn't really help if you slow down after you detect that you're doing damage.

Comments on this page:

By James (trs80) at 2022-01-09 11:26:41:

ArchiveTeam generally kicks into gear when sites are going away, so ArchiveBot is designed to be as quick as possible since scrapes often have to be completed in a matter of days or even hours, which obviously makes it quite impolite. The question is why your site got targeted in the first place, looking at the active projects the only plausible one is the rather-generic URLs project which looks like it's been running for a year or so.

I have run ArchiveBot instances in the past, but that was quite some years ago.

By Nick at 2022-01-10 06:06:26:

Seems inevitable. But interesting perspective, I never thought of the additional load that servers are burdened due to these pesky little robotic Imperial Empire Mouse droids of the internet.

That given, as long as they work towards a productive means and don't fail at that purpose, I'm not going to throw a tiff just yet. (Google Search Console threw some complete nonsense usability warnings at me a few times. Kinda pissed me off. Their machines screwing me over because they neglected to program them correctly!)

I have handled archivebot crawls myself, and I typically run those when a site is about to go away or is of particular interest.

In fact, one of the reactions I heard to this article is "we should crawl this site, it looks interesting!"

We do have rate limiting in place. The defaults may seem aggressive for a small site, but the stated rate (27 000 / day or 0.3 hit / s) does not seem very high. If you can't sustain a hit per second, I have to wonder what kind of bandwidth and hardware you're running...

But yes, when I do a run, I typically nurse it by hand for a few minutes to see how it goes, and if the server seems to be struggling, I turn on the rate limiting, which can be fine-tuned along the way. Returning a 403, BTW, is less obvious than a 5xx error message, but we typically do get the message.

But of course there's so much shit on fire on the internet that sometimes we might forget to be "polite", as you say. I compare this to the EMT work: you try to be polite most of the time, but when someone's unconscious, you're not going to go ahead and ask ""i'm sorry sir, you seem to be having a heart attack, would you mind us taking you to the nearest hospital so we can save your life?". You just do it.

And when you do that work long enough, you just start caring less and less. You just say "hi, sorry" and do your thing. It sucks, but then there's work to be done, and sites to be saved.



By cks at 2022-01-10 14:03:24:

For genuine Archive Team crawls of sites going away, I agree that low rate limits aren't desired. If something is shutting down in a week, you want to get as much as you can in the time you have left, and you don't necessarily want to respect 'please don't crawl this' markers. I don't know if this was a genuine Archive Team crawl of here (anyone can fake a user agent), and in any case here isn't going away so I feel no urgency to let the Archive Team crawl it, especially unrestrictedly.

(For sites that the Archive Team merely considers interesting, I strongly feel that the crawler should respect things like sitemaps, nofollow markers, and robots.txt. Anything that respects those won't crawl 27,000 URLs here.)

Those 27.000 requests did not happen over 24 hours; we roll logs at midnight local time and I noticed the volume before noon. When I looked at Apache timestamps, it appeared that the high rate was 4 requests in a single second and it didn't always do that. This is not enough load to cause actual problems to Wandering Thoughts, or I might have noticed earlier; it's simply enough load to be impolite, in my view.

(It might well be enough volume to cause problems to other sites. For instance, Charlie Stross had an entry reach the top of Hacker News recently, and had to add a comment to the top of it to the effect that the site was loading very slowly because it was on an ancient Athlon.)

Written on 08 January 2022.
« In practice, Debian (and Ubuntu) have fixed minimum system UIDs and GIDs
I have mixed feelings about the Go time package's time formatting strings »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sat Jan 8 23:32:06 2022
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.