Wandering Thoughts archives


A bad web scraper operating out of OVH IP address space

I'll start with my tweet:

I've now escalated to blocking entire OVH /16s to deal with the referer-forging web scraper that keeps hitting my techblog from OVH network space; they keep moving around too much for /24s.

I have strong views on forged HTTP referers, largely because I look at my Referer logs regularly and bogus entries destroy the usefulness of those logs. Making my logs noisy or useless is a fast and reliable way to get me to block sources from Wandering Thoughts. This particular web scraper hit a trifecta of things that annoy me about forged refers; the referers were bogus (they were for URLs that don't link to here), they were generated by a robot instead of a person, and they happened at volume.

The specific Referer URLs varied, but when I looked at them they were all for the kind of thing that might plausibly link to here; they were all real sites and often for recent blog entries (for example, one Referer URL used today was this openssl.org entry). Some of the Referers have utm_* query parameters that point to Feedburner, suggesting that they came from mining syndication feeds. This made the forged Referers more irritating, because even in small volume I couldn't dismiss them out of hand as completely implausible.

(Openssl.org is highly unlikely to link to here, but other places used as Referers were more possible.)

The target URLs here varied, but whatever software is doing this appears to be repeatedly scraping only a few pages instead of trying to spider around Wandering Thoughts. At the moment it appears to mostly be trying to scrape my recent entries, although I haven't done particularly extensive analysis. The claimed user agents vary fairly widely and cover a variety of browsers and especially of operating systems; today a single IP address claimed to be a Mac (running two different OS X versions), a Windows machine with Chrome 49, and a Linux machine (with equally implausible Chrome versions).

The specific IP addresses involved vary but they've all come from various portions of OVH network space. Initially there were few enough /24s involved in each particular OVH area that I blocked them by /24, but that stopped being enough earlier this week (when I made my tweet) and I escalated to blocking entire OVH /16s, which I will continue to do so as needed. Although this web scraper operates from multiple IP addresses, they appear to add new subnets only somewhat occasionally; my initial set of /24 blocks lasted for a fair while before they started getting through with new sources. So far this web scraper has not appeared anywhere outside of OVH, and with its Referer forging behavior I would definitely notice if it did.

(I've considered trying to block only OVH requests with Referer headers in order to be a little specific, but doing that with Apache's mod_rewrite appears likely to be annoying and it mostly wouldn't help any actual people, because their web browser would normally send Referer headers too. If there are other legitimate web spiders operating from OVH network space, well, I suggest that they relocate.)

I haven't even considered sending any report about this to OVH. Among many other issues, I doubt OVH would consider this a reason to terminate a paying customer (or to pressure a customer to terminate a sub-customer). This web scraper does not appear to be attacking me, merely sending web requests that I happen not to like.

(By 'today' I mean Saturday, which is logical today for me as I write this even if the clock has rolled past midnight.)

Sidebar: Source count information

Today saw 159 requests from 31 different IP addresses spread across 18 different /24s (and 10 different /16s). The most prolific IPs where the following ips:


None of these seem to be on any prominent DNS blocklists (not that I really know what's a prominent DNS blocklist any more, but they're certainly not on the SBL, unlike some people who keep trying).

web/OVHBadWebScraper written at 01:49:40; Add Comment

Page tools: See As Normal.
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.