Wandering Thoughts archives

2006-12-27

HTTP as it is seen in the wild

Out of a somewhat idle curiosity, I decided to do up some numbers for actual HTTP requests against one of the servers here. All of this is using the past 28 days of old logs (plus today's):

289160 total requests
277323 GET
  5722 PROPFIND
  3665 OPTIONS
  2215 POST
   178 HEAD
    39 CONNECT
    18 garbled

Most of the requests were successful; 90% got a 2xx or a 3xx response. 55,246 (21%) of the GET requests were successful conditional GETs, out of 256,392 successful GETs; I'm not sure whether to consider this good or bad.

(Unfortunately I don't have enough information to find out how many requests were willing to accept gzip'd results.)

The popularity of PROPFIND and OPTIONS surprised me, but almost all of them turn out to be from just three external IPs, with the lion's share coming from just one. Most of the OPTIONS requests were to /, and most of the PROPFIND requests were to the (nonexistent) /LJF4100, so I suspect that someone's machine is badly misconfigured.

The majority of the HEAD requests were for /, with my Atom syndication feed being the somewhat distant runner-up. Requests came from all over with nothing clearly dominating the results.

(From this I conclude that optimizing HEAD is not really a high priority, which is good because DWiki doesn't.)

HTTP/1.0 dominated over HTTP/1.1, about 67% to 33%; no one is still making pre-HTTP/1.0 requests. (Apart from our very primitive monitoring system, which I am ignoring for this.)

A small number of apparently legitimate people made requests with full 'http://...' URLs (theoretically only usable against proxies; 396 requests in total). To my surprise, a full third of them used HTTP/1.0; the rest used HTTP/1.1.

Requests came from 11,745 different IP addresses. The average number of requests per IP was 24.6, but the median was only 3 (and the mode was 1 request, which does not surprise me). A surprisingly large number of the IPs that made only one request asked for robots.txt (although it was not the most popular such request). As usual, the most active visitor was our internal search engine.

Sidebar: POST targets

This server (currently) hosts CSpace (and thus WanderingThoughts), which is what the majority of the POST requests were directed against (1,299 out of the 2,215; I get a fair number of comment spam attempts). A small number of the remainder (126) were legitimate; the rest were bad in various ways, ranging from repeatedly poking nonexistent URLs to various XML RPC exploit attempts (and one mysterious POST to /).

The most popular POST target was the nonexistent URL path /officescan/cgi/cgiRecvFile.exe, followed by my Recent Comments page.

Sidebar: the breakdown of responses

Distribution of HTTP response codes:

201807 2xx
       199273 200
         2534 206
 59814 3xx
        55246 304
         4106 301
          459 302
 27506 4xx
        13494 404
         8018 403
         5756 405
          234 400
            2 401
            1 416
            1 414
    30 5xx

Some of the 404'd URLs are fairly popular, but I'm not going to try to read the tea leaves about that.

HTTPInTheWild written at 00:32:24; Add Comment

2006-12-24

What Google Sitemaps isn't

The Google Sitemaps XML format has a somewhat underdocumented <priority> field, which is described as:

The priority of this URL relative to other URLs on your site. [...]

The Google documentation is somewhat imprecise here, as there are at least two meanings for 'priority' and they don't really say which one they mean. The first sort of priority is 'which pages do I want crawled first'; the second sort of priority is 'which pages (within my site) do I want ranked first in search results'.

To cut to the chase: Google Sitemaps <priority> is not the relative priority in search results (the second sort of priority). It only seems to influence how Google crawls your site (the first sort of priority), which probably doesn't really matter unless you have a very large site.

This is disappointing, because when Google Sitemaps was first announced I was really hoping it would help me deal with a perpetual problem: I want my individual blog entries ranked higher than my index pages on search results.

The problem is that, unlike normal sites, blogs have a lot of duplicate content, since various sorts of index pages repeat individual entries wholesale. This means a Google search will result in multiple URLs, which Google has to rank somehow, and you would like the URL for the entry itself to rank highest; it's the most stable (there is no guarantee that the index page will still have the same entries as when Google crawled it) and it's got the least distractions to obscure what the user is looking for (on an index page they have to find the right entry).

It would be nice if there was a way of telling Google about this, short of telling it not to index your index pages (which I am leery of). Maybe there is, but if there is it is not the Sitemaps <priority> field.

(Interestingly, Google seems to relatively consistently get this right for some places, such as LiveJournal. I can't help suspecting that they have special tuning for well-known blog sites and blogging packages.)

Of course this hardly matters right now, as Google has been unhappy with CSpace's sitemap for some time now for some mysterious reason. (Yes, I've validated it.)

GoogleSitemapsIsnt written at 16:01:36; Add Comment

2006-12-18

A basic principle of website security

In theory I shouldn't have to say this, but in practice I probably do. One of the most basic principle of designing secure websites is simply this:

Never trust anything you get from the network.

Everything you get from the network is under the control of a sufficiently determined attacker, no matter how it is 'supposed' to be generated. Every request, every form POST, every cookie, and every AJAX callback. No amount of obfuscation can do more than slow an attacker down.

(In fact, obfuscation and attempts to hide things are a useful signpost to would-be attackers of where to look closely, a lesson I believe I learned from Harry Harrison's Stainless Steel Rat.)

As a bonus to not trusting network input, you'll gain resilience against the various badly coded crawlers and web browsers that send you crazy things from time to time.

(This grump was sparked by reading this (from Slashdot), which beats around the fundamental bush a bit too much for my taste. I suppose this is what I get for following a Slashdot link.)

BasicWebsiteSecurity written at 14:18:20; Add Comment

2006-12-17

How to get me to block your web ads in a flash

Tim Bray:

The animation in Web display ads is outta control, outta control, I tell ya!

What he said (except that it's been going on for years). The fastest and best way to get me to really kick your ads to the curb has always been to make them blink and get in my face, and it amazes me that anyone has ever thought such ads were a good idea.

(Huge, page-disrupting and modem-saturating ads don't help, but they are not as scream and leap as frenzied animation is.)

It also amazes me that people are willing to run ads on their pages that are taking deliberate steps to be more attention grabbing than the content. Apart from people who are only in it for the drive-by ad revenue from suckers, it seems self defeating to any efforts to build a long-term audience, since you usually only have a relatively brief chance to hook a first time visitor and persuade them to come back later, and what is presumably going to hook them is your content, not the collection of blinking stuff, so you want your content to be what their attention naturally settles on first.

(It's easy to see why blinking things are attractive to ad people, since it is well established that people reflexively pay attention to apparent motion and change. But this doesn't mean that using this low-level hook is a good idea or is likely to accomplish your actual goals, as most people who've tried using the <blink> tag can probably testify.)

Sidebar: what I use to get rid of ads

I've used the junkbuster filtering HTTP proxy since at least 1997 or so. Although it has limitations (like no HTTP/1.1 support), I still prefer it to in-browser solutions like Firefox's AdBlock, partly because I find its simple plain text configuration easier to manipulation. (And of course when I started using it, a filtering proxy was the only real option.)

I've never bothered upgrading to privoxy, the current line of junkbuster development, partly because junkbuster works fine for me so I have very little motivation to change. I did wind up trying privoxy out recently (due to people starting to stuff ads into RSS feeds), and my general view is that boy does it seem to have a complicated configuration system.

AnnoyingWebAds written at 22:55:09; Add Comment

2006-12-10

An irony of web serving

One of the small paradoxes of the web is that it is often the connections with the least bandwidth that put the largest load on your web server.

This is because each connection consumes a certain amount of server resources, ranging from kernel data structures for socket buffers up to an entire thread or process on a dynamic website. The slower someone's connection, the longer they tie up up this stuff on your end as you slowly feed them data. Conversely, people on fast connections get in, get their data, and get out fast, letting you release those resources.

Among other little effects, this means that it's not enough for load testing to pick a connections per second rate that you should be able to deal with. Ten new connections a second where each client takes a tenth of a second to download its content is rather easier to deal with than ten connections a second where each client takes ten seconds. (And if you test across a local LAN you are far more likely to get the former than the latter.)

There are some ways around parts of this effect:

  • web servers based around asynchronous IO generally have far lower per-connection overhead, which is one reason they're so popular.

  • reverse proxy web servers (including in a way Apache running CGI programs) offer you a way of rapidly sucking the content out of your high-overhead dynamic website system and parking it in a low-overhead frontend web server while it trickles out to the slow clients, instead of having the slow clients hold down an expensive connection directly with the dynamic website bits.

    (This only works well if your generated content is small enough to get sucked completely into the frontend, but this is the usual case.)

  • some websites just disconnect clients after a certain amount of time, whether or not they are still transferring data. This is most popular for bulk downloads, where it's cheap for the server to start again if (or when) the client reconnects to resume the transfer.
ConnectionSpeedLoad written at 22:20:00; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.