2025-02-16
The HTTP status codes of responses from about 21 hours of traffic to here
You may have heard that there are a lot of crawlers out there these days, many of them apparently harvesting training data for LLMs. Recently I've been getting more strict about access to this blog, so for my own interest I'm going to show statistics on what HTTP status codes all of the requests to here got in the past roughly 21 hours and a bit. I think this is about typical, although there may be more blocked things than usual.
I'll start with the overall numbers for all requests:
22792 403 [45%] 9207 304 [18.3%] 9055 200 [17.9%] 8641 429 [17.1%] 518 301 58 400 33 404 2 206 1 302
HTTP 403 is the error code that people get on blocked access; I'm not sure what's producing the HTTP 400s. The two HTTP 206s were from LinkedIn's bot against a recent entry and completely puzzle me. Some of the blocked access is major web crawlers requesting things that they shouldn't (Bing is a special repeat offender here), but many of them are not. Between HTTP 403s and HTTP 429s, 62% or so of the requests overall were rejected and only 36% got a useful reply.
(With less thorough and active blocks, that would be a lot more traffic for Wandering Thoughts to handle.)
The picture for syndication feeds is rather different, as you might expect, but not quite as different as I'd like:
9136 304 [39.5%] 8641 429 [37.4%] 3614 403 [15.6%] 1663 200 [ 7.2%] 19 301
Some of those rejections are for major web crawlers and almost a thousand are for a pair of prolific, repeat high volume request sources, but a lot of them aren't. Feed requests account for 23073 requests out of a total of 50307, or about 45% of the requests. To me this feels quite low for anything plausibly originated from humans; most of the time I expect feed requests to significantly outnumber actual people visiting.
(In terms of my syndication feed rate limiting, there were 19440 'real' syndication feed requests (84% of the total attempts), and out of them 44.4% were rate-limited. That's actually a lower level of rate limiting than I expected; possibly various feed fetchers have actually noticed it and reduced their attempt frequency. 46.9% made successful conditional GET requests (ones that got a HTTP 304 response) and 8.5% actually fetched feed data.)
DWiki, the wiki engine behind the blog, has a concept of alternate 'views' of pages. Syndication feeds are alternate views, but so are a bunch of other things. Excluding syndication feeds, the picture for requests of alternate views of pages is:
5499 403 510 200 39 301 3 304
The most blocked alternate views are:
1589 ?writecomment 1336 ?normal 1309 ?source 917 ?showcomments
(The most successfully requested view is '?showcomments', which isn't really a surprise to me; I expect search engines to look through that, for one.)
If I look only at plain requests, not requests for syndication feeds or alternate views, I see:
13679 403 [64.5%] 6882 200 [32.4%] 460 301 68 304 58 400 33 404 2 206 1 302
This means the breakdown of traffic is 21183 normal requests (42%), 45% feed requests, and the remainder for alternate views, almost all of which were rejected.
Out of the HTTP 403 rejections across all requests, the 'sources' break down something like this:
7116 Forged Chrome/129.0.0.0 User-Agent 1451 Bingbot 1173 Forged Chrome/121.0.0.0 User-Agent 930 PerplexityBot ('AI' LLM data crawler) 915 Blocked sources using a 'Go-http-client/1.1' User-Agent
Those HTTP 403 rejections came from 12619 different IP addresses, in contrast to the successful requests (HTTP 2xx and 3xx codes), which came from 18783 different IP addresses. After looking into the ASN breakdown of those IPs, I've decided that I can't write anything about them with confidence, and it's possible that part of what is going on is that I have mis-firing blocking rules (alternately, I'm being hit from a big network of compromised machines being used as proxies, perhaps the same network that is the Chrome/129.0.0.0 source). However, some of the ASNs that show up highly are definitely ones I recognize from other contexts, such as attempted comment spam.
Update: Well that was a learning experience about actual browser User-Agents. Those 'Chrome/129.0.0.0' User-Agents may well not have been so forged (although people really should be running more current versions of Chrome). I apologize to the people using real current Chrome versions that were temporarily unable to read the blog because of my overly-aggressive blocks.