The HTTP status codes of responses from about 22 hours of traffic to here (part 2)

May 2, 2025

A few months ago, I wrote an entry about this topic, because I'd started putting in some blocks against crawlers, including things that claimed to be old versions of browsers, and I'd also started rate-limiting syndication feed fetching. Unfortunately, my rules at the time were flawed, rejecting a lot of people that I actually wanted to accept. So here are some revised numbers from today, a day when my logs suggest that I've seen what I'd call broadly typical traffic and traffic levels.

I'll start with the overall numbers (for HTTP status codes) for all requests:

  10592 403		[26.6%]
   9872 304		[24.8%]
   9388 429		[23.6%]
   8037 200		[20.2%]
   1629 302		[ 4.1%]
    114 301
     47 404
      2 400
      2 206

This is a much more balanced picture of activity than the last time around, with a lot less of the overall traffic being HTTP 403s. The HTTP 403s are from aggressive blocks, the HTTP 304s and HTTP 429s are mostly from syndication feed fetchers, and the HTTP 302s are mostly from things with various flaws that I redirect to informative static pages instead of giving HTTP 403s. The two HTTP 206s were from Facebook's 'externalhit' agent on a recent entry. A disturbing amount of the HTTP 403s were from Bing's crawler and almost 500 of them were from something claiming to be an Akkoma Fediverse server. 8.5% of the HTTP 403s were from something using Go's default User-Agent string.

The most popular User-Agent strings today for successful requests (of anything) were for versions of NetNewsWire, FreshRSS, and Miniflux, then Googlebot and Applebot, and then Chrome 130 on 'Windows NT 10'. Although I haven't checked, I assume that all of the first three were for syndication feeds specifically, with few or no fetches of other things. Meanwhile, Googlebot and Applebot can only fetch regular pages; they're blocked from syndication feeds.

The picture for syndication feeds looks like this:

   9923 304		[42%]
   9535 429		[40%]
   1984 403		[ 8.5%]
   1600 200		[ 6.8%]
    301 302
     34 301
      1 404

On the one hand it's nice that 42% of syndication feed fetches successfully did a conditional GET. On the other hand, it's not nice that 40% of them got rate-limited, or that there were clearly more explicitly blocked requests that there were HTTP 200 responses. On the sort of good side, 37% of the blocked feed fetches were from one IP that's using "Go-http-client/1.1" as its User-Agent (and which accounts for 80% of the blocks of that). This time around, about 58% of the requests were for my syndication feed, which is better than it was before but still not great.

These days, if certain problems are detected in a request I redirect the request to a static page about the problem. This gives me some indication of how often these issues are detected, although crawlers may be re-visiting the pages on their own (I can't tell). Today's breakdown of this is roughly:

   78%  too-old browser
   13%  too generic a User-Agent
    9%  unexpectedly using HTTP/1.0

There were slightly more HTTP 302 responses from requests to here than there were requests for these static pages, so I suspect that not everything that gets these redirects follows them (or at least doesn't bother re-fetching the static page).

I hope that the better balance in HTTP status codes here is a sign that I have my blocks in a better state than I did a couple of months ago. It would be even better if the bad crawlers would go away, but there's little sign of that happening any time soon.


Comments on this page:

As a miniflux user, I observed an interesting situation that you might want to consider in your block rules. Upon adding a new feed to miniflux, two requests are made. One is a “discovery” request of sorts and the other is fetching the feed. These happen in quick succession and the only way to avoid them, and subsequently your rate limiting, is through the miniflux API. There seems to have been past issues reported to miniflux about this which were resolved (https://github.com/miniflux/v2/issues/2128) but the current situation seems like a valid use which you are blocking.

By cks at 2025-05-03 12:54:44:

My view on any feed reader making multiple feed requests when setting up a feed is that this a clear feed reader client bug. Feed readers should never, ever repeatedly request a feed in a short time, and if they do a HTTP 429 is an entirely reasonable response. And while Miniflux isn't the worst offender, it's clear from my logs that it doesn't seem to notice (or respect) the feed fetching timing information exposed in Cache-Control headers.

Given all of this, I'm not going to (significantly) complicate my server side code to try to compensate for Miniflux's registration bug, especially since I clearly need to keep rate-limiting Miniflux in general.

It doesn't seem to notice (or respect) the feed fetching timing information exposed in Cache-Control headers.

Does anything?

By Simon at 2025-05-06 18:35:19:
  10592 403             [26.6%]

As a Tor user, since a while, I get 403 often when accessing your blog (based on gut feeling maybe 50 % of the time).

Written on 02 May 2025.
« The complexity of mixing mesh networking and routes to subnets
These days, Linux audio seems to just work (at least for me) »

Page tools: View Source, View Normal.
Search:
Login: Password:

Last modified: Fri May 2 23:09:52 2025
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.