Good web scraping is not just about avoiding load

January 8, 2022

I'll start with a little thing from Twitter:

@thatcks: Whelp, something calling itself the ArchiveTeam's ArchiveBot just got itself thoroughly banned from my techblog by having a crawling rate measured in requests a second. Over 27,000 requests today so far. That's not how you do it, people. (Let's see if it notices the 403s.)

@<redacted>: I mean, that would be rude in 1999 but in 2022 my watch could serve triple digits per second of a mostly-plaintext blog without breaking a sweat

One of my opinions here is that good web scraping is not just about avoiding load on the target. Ultimately, good web scraping is about being polite. One of the things that's definitely impolite is overloading the target; harming a scraping target is not a good thing. But another thing that's impolite, at least in my view (and my view is what matters for Wandering Thoughts), is simple being too large a source of requests and traffic. And 27,000 requests from a single source is at least one order of magnitude larger than I normally see, and the single largest regular source is itself an unreasonable one.

(As I noted on Twitter, another impolite thing is scraping URLs that people marked in various ways as not to be scraped. If a web scraper obsessively visits every 'write a comment' page here and so on, it's going to get blocked.)

Web robots and web scrapers often don't care about being polite. Politeness isn't a technical issue or a required technical standard, the way properly using HTTP is; politeness is generally optional. But so is allowing web scrapers to scrape things at all. The more polite a web scraper attempts to be, the better the odds that site operators won't object. Conversely, the more high volume and obvious and intrusive the web scrape is, the more likely people are going to have bad reactions.

Of course this raises the question of what exactly is polite behavior. The honest answer is that while I have my own opinions, I don't think there's any Internet consensus, so if you genuinely want to have a polite web scraper, you're going to have to try to work that out for yourself. Expect it to take some work.

The obvious additional thing to note is that even today, there are plenty of sites that cannot sustain multiple requests a second from a web scraper (much less from the web scraper and other traffic). This is unfortunate and people write screeds about how it doesn't have to be this way (I've dabbled in this), but the reality of the Internet is that there are plenty of inefficient slow sites, especially small ones like blogs. Throwing the kind of rapid scraping at those sites is more than impolite, it's damaging, and it doesn't really help if you slow down after you detect that you're doing damage.

Written on 08 January 2022.
« In practice, Debian (and Ubuntu) have fixed minimum system UIDs and GIDs
I have mixed feelings about the Go time package's time formatting strings »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sat Jan 8 23:32:06 2022
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.