The HTTP status codes of responses from about 22 hours of traffic to here (part 2)
A few months ago, I wrote an entry about this topic, because I'd started putting in some blocks against crawlers, including things that claimed to be old versions of browsers, and I'd also started rate-limiting syndication feed fetching. Unfortunately, my rules at the time were flawed, rejecting a lot of people that I actually wanted to accept. So here are some revised numbers from today, a day when my logs suggest that I've seen what I'd call broadly typical traffic and traffic levels.
I'll start with the overall numbers (for HTTP status codes) for all requests:
10592 403 [26.6%] 9872 304 [24.8%] 9388 429 [23.6%] 8037 200 [20.2%] 1629 302 [ 4.1%] 114 301 47 404 2 400 2 206
This is a much more balanced picture of activity than the last time around, with a lot less of the overall traffic being HTTP 403s. The HTTP 403s are from aggressive blocks, the HTTP 304s and HTTP 429s are mostly from syndication feed fetchers, and the HTTP 302s are mostly from things with various flaws that I redirect to informative static pages instead of giving HTTP 403s. The two HTTP 206s were from Facebook's 'externalhit' agent on a recent entry. A disturbing amount of the HTTP 403s were from Bing's crawler and almost 500 of them were from something claiming to be an Akkoma Fediverse server. 8.5% of the HTTP 403s were from something using Go's default User-Agent string.
The most popular User-Agent strings today for successful requests (of anything) were for versions of NetNewsWire, FreshRSS, and Miniflux, then Googlebot and Applebot, and then Chrome 130 on 'Windows NT 10'. Although I haven't checked, I assume that all of the first three were for syndication feeds specifically, with few or no fetches of other things. Meanwhile, Googlebot and Applebot can only fetch regular pages; they're blocked from syndication feeds.
The picture for syndication feeds looks like this:
9923 304 [42%] 9535 429 [40%] 1984 403 [ 8.5%] 1600 200 [ 6.8%] 301 302 34 301 1 404
On the one hand it's nice that 42% of syndication feed fetches successfully did a conditional GET. On the other hand, it's not nice that 40% of them got rate-limited, or that there were clearly more explicitly blocked requests that there were HTTP 200 responses. On the sort of good side, 37% of the blocked feed fetches were from one IP that's using "Go-http-client/1.1" as its User-Agent (and which accounts for 80% of the blocks of that). This time around, about 58% of the requests were for my syndication feed, which is better than it was before but still not great.
These days, if certain problems are detected in a request I redirect the request to a static page about the problem. This gives me some indication of how often these issues are detected, although crawlers may be re-visiting the pages on their own (I can't tell). Today's breakdown of this is roughly:
78% too-old browser 13% too generic a User-Agent 9% unexpectedly using HTTP/1.0
There were slightly more HTTP 302 responses from requests to here than there were requests for these static pages, so I suspect that not everything that gets these redirects follows them (or at least doesn't bother re-fetching the static page).
I hope that the better balance in HTTP status codes here is a sign that I have my blocks in a better state than I did a couple of months ago. It would be even better if the bad crawlers would go away, but there's little sign of that happening any time soon.
Comments on this page:
|
|