Wandering Thoughts archives

2025-02-16

The HTTP status codes of responses from about 21 hours of traffic to here

You may have heard that there are a lot of crawlers out there these days, many of them apparently harvesting training data for LLMs. Recently I've been getting more strict about access to this blog, so for my own interest I'm going to show statistics on what HTTP status codes all of the requests to here got in the past roughly 21 hours and a bit. I think this is about typical, although there may be more blocked things than usual.

I'll start with the overall numbers for all requests:

 22792 403      [45%]
  9207 304      [18.3%]
  9055 200      [17.9%]
  8641 429      [17.1%]
   518 301
    58 400
    33 404
     2 206
     1 302

HTTP 403 is the error code that people get on blocked access; I'm not sure what's producing the HTTP 400s. The two HTTP 206s were from LinkedIn's bot against a recent entry and completely puzzle me. Some of the blocked access is major web crawlers requesting things that they shouldn't (Bing is a special repeat offender here), but many of them are not. Between HTTP 403s and HTTP 429s, 62% or so of the requests overall were rejected and only 36% got a useful reply.

(With less thorough and active blocks, that would be a lot more traffic for Wandering Thoughts to handle.)

The picture for syndication feeds is rather different, as you might expect, but not quite as different as I'd like:

  9136 304    [39.5%]
  8641 429    [37.4%]
  3614 403    [15.6%]
  1663 200    [ 7.2%]
    19 301

Some of those rejections are for major web crawlers and almost a thousand are for a pair of prolific, repeat high volume request sources, but a lot of them aren't. Feed requests account for 23073 requests out of a total of 50307, or about 45% of the requests. To me this feels quite low for anything plausibly originated from humans; most of the time I expect feed requests to significantly outnumber actual people visiting.

(In terms of my syndication feed rate limiting, there were 19440 'real' syndication feed requests (84% of the total attempts), and out of them 44.4% were rate-limited. That's actually a lower level of rate limiting than I expected; possibly various feed fetchers have actually noticed it and reduced their attempt frequency. 46.9% made successful conditional GET requests (ones that got a HTTP 304 response) and 8.5% actually fetched feed data.)

DWiki, the wiki engine behind the blog, has a concept of alternate 'views' of pages. Syndication feeds are alternate views, but so are a bunch of other things. Excluding syndication feeds, the picture for requests of alternate views of pages is:

  5499 403
   510 200
    39 301
     3 304

The most blocked alternate views are:

  1589 ?writecomment
  1336 ?normal
  1309 ?source
   917 ?showcomments

(The most successfully requested view is '?showcomments', which isn't really a surprise to me; I expect search engines to look through that, for one.)

If I look only at plain requests, not requests for syndication feeds or alternate views, I see:

 13679 403   [64.5%]
  6882 200   [32.4%]
   460 301
    68 304
    58 400
    33 404
     2 206
     1 302

This means the breakdown of traffic is 21183 normal requests (42%), 45% feed requests, and the remainder for alternate views, almost all of which were rejected.

Out of the HTTP 403 rejections across all requests, the 'sources' break down something like this:

  7116 Forged Chrome/129.0.0.0 User-Agent
  1451 Bingbot
  1173 Forged Chrome/121.0.0.0 User-Agent
   930 PerplexityBot ('AI' LLM data crawler)
   915 Blocked sources using a 'Go-http-client/1.1' User-Agent

Those HTTP 403 rejections came from 12619 different IP addresses, in contrast to the successful requests (HTTP 2xx and 3xx codes), which came from 18783 different IP addresses. After looking into the ASN breakdown of those IPs, I've decided that I can't write anything about them with confidence, and it's possible that part of what is going on is that I have mis-firing blocking rules (alternately, I'm being hit from a big network of compromised machines being used as proxies, perhaps the same network that is the Chrome/129.0.0.0 source). However, some of the ASNs that show up highly are definitely ones I recognize from other contexts, such as attempted comment spam.

Update: Well that was a learning experience about actual browser User-Agents. Those 'Chrome/129.0.0.0' User-Agents may well not have been so forged (although people really should be running more current versions of Chrome). I apologize to the people using real current Chrome versions that were temporarily unable to read the blog because of my overly-aggressive blocks.

web/HTTPStatusCodesHere-2025-02-16 written at 23:06:54;


Page tools: See As Normal.
Search:
Login: Password:

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.