2013-01-06
24 hours of Atom feed requests here
Because I'm interested in this sort of thing, I decided to generate some statistics on 24 hours of Atom syndication feed requests for Wandering Thoughts. Mostly I'm going to report the relatively raw numbers, although later I'll probably do detailed analysis of one aspect.
The big numbers:
- 2,671 HTTP requests, or one every 32 seconds if distributed evenly.
(They seem to have been reasonably evenly distributed through the day, although there are some shifts between hours; the peak hour was between 6am and 7am Eastern.)
- Requests were made for 95 different feed URLs. Mostly this shows the
hazards of excessive generality; there are a lot of web crawlers that
request Atom feeds for useless virtual directories and so on.
- The most popular feed is the main blog feed (2,010 requests), very distantly followed by the python category (185 requests), the tech category (72 requests), and so on.
The rest of this analysis is going to focus on the 2,010 requests for the main blog feed so that I don't have to worry about the effects of random crawler requests for random feeds.
Almost all of those requests were GET
requests; 1,986 GET
s to
24 HEAD
s. Every HEAD
request was from Google Producer.
Given the documentation at that link, I have no idea why it's issuing
HEAD
requests for my actual Atom feed (but good luck getting anyone
in Google to explain).
HTTP/1.1 was by far the most dominant HTTP protocol; there were 1,528 HTTP/1.1 requests to 482 HTTP/1.0 requests.
Broken down by HTTP response codes, there were 1,312 304's (content not modified) to 646 200's and 52 403 permission denials. All of the 403's were for what my code identified as bad web robots (which are not supposed to crawl my Atom feeds), and most of them (48) were from a single IP address (178.63.170.37) with a bad user-agent.
Those 2,010 requests came from only 200 different IP addresses,
although no single IP was a huge traffic source; the most active
single IP made 133 connections and many made fewer; 85 IP addresses
made only one request. 111 different IP addresses made requests
that got 304 Not Modified responses, of which 35 were IP addresses
that made only one request. This is not really great, since it
implies that about a third of the IPs that made multiple requests
probably don't implement proper conditional GET
support. I'm sad to see that the most prolific source of requests
also didn't seem to support conditional GET
; all of its
requests got status 200 responses.
(Most of the other active sources seem to have gotten plenty of
status 304 responses, which is what should happen; if you're going
to poll an Atom feed frequently, you should implement support for
conditional GET
.)
Of the 646 requests that got status 200 responses with content, 558 of them clearly support HTTP compression while 64 don't (I can tell by the response sizes). These no-compression requests came from 15 different IPs, several of which made multiple requests. A number of sources appear to be disguised web spiders.
(I am pretty certain that anything that claims to be running 64-bit Ubuntu with Firefox 8.0 is kind of shading the truth. A lot.)
In other trivia, it appears that the most popular User-Agent
value for
people pulling the feed that day was 'user-agent
' (sic). I think the
sources using that value are actually probably legitimate, at least for
that day. They didn't do conditional GET
, though, which makes them
somewhat annoying. (Both different IP addresses did do HTTP compression.)