An aggressive, stealthy web spider operating from Microsoft IP space
For a few days, I've been noticing some anomalies in some metrics surrounding Wandering Thoughts, but nothing stood out as particularly wrong and my usual habit of looking at the top IP addresses requesting URLs from here didn't turn up anything. Then today I randomly wound up looking at the user-agents of things making requests here and found something unpleasant under the rock I'd just turned over:
Today I discovered that there appears to be a large scale stealth web crawler operating out of Microsoft IP space with the forged user-agent 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/15.1 Safari/605.1.15', which I believe is a legitimate UA. Current status: working out how to block this in Apache .htaccess.
By the time I noticed it today this spider had made somewhere over 25,000 requests today in somewhat over twelve hours, or at least with that specific user agent (it's hard to see if it used other ones with all of the volume). It made these requests from over 5,800 different IPs; over 600 of these IPs are on the SBL CSS and one of them is SBL 545445 (a /32 phish server). All of these IP addresses are in various networks in Microsoft's AS 8075, and of course none of them have reverse DNS. As you can tell from the significant number of IPs, most IPs do only a few requests and even the active ones did no more than 20 (today, by the time I cut them off). This is a volume level that will fly under the radar for anyone's per-IP ratelimiting.
(Another person reported a similar experience including the low volume per IP. Also, I assume that there is some Microsoft cloud feature for changing your outgoing IP all the time that this spider is exploiting, as opposed to the spider operator having that many virtual machines churning away in Microsoft's cloud.)
This spider seems to have only shown up about five or six days ago. Before then this user agent has no particular prominence in my logs, but then in the past couple of days it's go up to almost 50,000 requests a day. At that request volume most of it is spidering or re-spidering uselessly duplicated content; Wandering Thoughts doesn't have that many unique pages.
This user agent is for Safari 15.1, which was released more than a year ago (apparently October 27th, 2021, or maybe a few days before), and as such is rather out of date by now. Safari on macOS is up to Safari 16, and Safari 15 was (eventually) updated to 15.6.1. I don't know why this spider picked such an out of date user agent to forge, but it's convenient; any actual person still running Safari 15.1 needs to update it anyway to pick up security fixes.
(For the moment, the best I could do with my eccentric setup here was to block anyone using the user agent. Blocking by IP address range is annoying, seeing as today's lot of IP addresses are spread over 20 /16s.)
Sidebar: On the forging of user agents
On the Fediverse, I was asked if it wasn't the case that all user-agent strings were forged in some sense, since these days they're mostly about a statement of compatibility. My off the cuff answer encapsulates something that I want to repeat here:
There is a widespread de facto standard that spiders, crawlers, and other automated agents must report themselves in their user-agent instead of pretending to be browsers.
To put it one way, humans may impersonate each other, but machines do not get to impersonate humans. Machines who try to are immediately assumed to be up to no good, with ample historical reasons to make such an assumption.
(See also my views on your
User-Agent header should include and
The other thing about this is that compatibility is a matter for browsers, not spiders. If your spider claims to be 'compatible' with Googlebot, what you're really asking for is any special treatment people give Googlebot.
(Sometimes this backfires, if people are refusing things to Googlebot.)