Wandering Thoughts archives

2013-01-06

24 hours of Atom feed requests here

Because I'm interested in this sort of thing, I decided to generate some statistics on 24 hours of Atom syndication feed requests for Wandering Thoughts. Mostly I'm going to report the relatively raw numbers, although later I'll probably do detailed analysis of one aspect.

The big numbers:

  • 2,671 HTTP requests, or one every 32 seconds if distributed evenly.

    (They seem to have been reasonably evenly distributed through the day, although there are some shifts between hours; the peak hour was between 6am and 7am Eastern.)

  • Requests were made for 95 different feed URLs. Mostly this shows the hazards of excessive generality; there are a lot of web crawlers that request Atom feeds for useless virtual directories and so on.

  • The most popular feed is the main blog feed (2,010 requests), very distantly followed by the python category (185 requests), the tech category (72 requests), and so on.

The rest of this analysis is going to focus on the 2,010 requests for the main blog feed so that I don't have to worry about the effects of random crawler requests for random feeds.

Almost all of those requests were GET requests; 1,986 GETs to 24 HEADs. Every HEAD request was from Google Producer. Given the documentation at that link, I have no idea why it's issuing HEAD requests for my actual Atom feed (but good luck getting anyone in Google to explain).

HTTP/1.1 was by far the most dominant HTTP protocol; there were 1,528 HTTP/1.1 requests to 482 HTTP/1.0 requests.

Broken down by HTTP response codes, there were 1,312 304's (content not modified) to 646 200's and 52 403 permission denials. All of the 403's were for what my code identified as bad web robots (which are not supposed to crawl my Atom feeds), and most of them (48) were from a single IP address (178.63.170.37) with a bad user-agent.

Those 2,010 requests came from only 200 different IP addresses, although no single IP was a huge traffic source; the most active single IP made 133 connections and many made fewer; 85 IP addresses made only one request. 111 different IP addresses made requests that got 304 Not Modified responses, of which 35 were IP addresses that made only one request. This is not really great, since it implies that about a third of the IPs that made multiple requests probably don't implement proper conditional GET support. I'm sad to see that the most prolific source of requests also didn't seem to support conditional GET; all of its requests got status 200 responses.

(Most of the other active sources seem to have gotten plenty of status 304 responses, which is what should happen; if you're going to poll an Atom feed frequently, you should implement support for conditional GET.)

Of the 646 requests that got status 200 responses with content, 558 of them clearly support HTTP compression while 64 don't (I can tell by the response sizes). These no-compression requests came from 15 different IPs, several of which made multiple requests. A number of sources appear to be disguised web spiders.

(I am pretty certain that anything that claims to be running 64-bit Ubuntu with Firefox 8.0 is kind of shading the truth. A lot.)

In other trivia, it appears that the most popular User-Agent value for people pulling the feed that day was 'user-agent' (sic). I think the sources using that value are actually probably legitimate, at least for that day. They didn't do conditional GET, though, which makes them somewhat annoying. (Both different IP addresses did do HTTP compression.)

web/Atom24Hours written at 03:46:11; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.