2006-12-27
HTTP as it is seen in the wild
Out of a somewhat idle curiosity, I decided to do up some numbers for actual HTTP requests against one of the servers here. All of this is using the past 28 days of old logs (plus today's):
289160 total requests
277323 GET
5722 PROPFIND
3665 OPTIONS
2215 POST
178 HEAD
39 CONNECT
18 garbled
Most of the requests were successful; 90% got a 2xx or a 3xx response. 55,246 (21%) of the GET requests were successful conditional GETs, out of 256,392 successful GETs; I'm not sure whether to consider this good or bad.
(Unfortunately I don't have enough information to find out how many requests were willing to accept gzip'd results.)
The popularity of PROPFIND and OPTIONS surprised me, but almost all of
them turn out to be from just three external IPs, with the lion's share
coming from just one. Most of the OPTIONS requests were to /, and most
of the PROPFIND requests were to the (nonexistent) /LJF4100, so I
suspect that someone's machine is badly misconfigured.
The majority of the HEAD requests were for /, with my
Atom syndication feed being the somewhat distant runner-up. Requests
came from all over with nothing clearly dominating the results.
(From this I conclude that optimizing HEAD is not really a high priority, which is good because DWiki doesn't.)
HTTP/1.0 dominated over HTTP/1.1, about 67% to 33%; no one is still making pre-HTTP/1.0 requests. (Apart from our very primitive monitoring system, which I am ignoring for this.)
A small number of apparently legitimate people made requests with full 'http://...' URLs (theoretically only usable against proxies; 396 requests in total). To my surprise, a full third of them used HTTP/1.0; the rest used HTTP/1.1.
Requests came from 11,745 different IP addresses. The average number
of requests per IP was 24.6, but the median was only 3 (and the mode was 1 request,
which does not surprise me). A surprisingly large number of the IPs that
made only one request asked for robots.txt (although it was not the
most popular such request). As usual, the most active visitor was our
internal search engine.
Sidebar: POST targets
This server (currently) hosts CSpace (and thus WanderingThoughts), which
is what the majority of the POST requests were directed against (1,299
out of the 2,215; I get a fair number of comment spam attempts). A small
number of the remainder (126) were legitimate; the rest were bad in
various ways, ranging from repeatedly poking nonexistent URLs to various
XML RPC exploit attempts (and one mysterious POST to /).
The most popular POST target was the nonexistent URL path /officescan/cgi/cgiRecvFile.exe, followed by my Recent Comments page.
Sidebar: the breakdown of responses
Distribution of HTTP response codes:
201807 2xx
199273 200
2534 206
59814 3xx
55246 304
4106 301
459 302
27506 4xx
13494 404
8018 403
5756 405
234 400
2 401
1 416
1 414
30 5xx
Some of the 404'd URLs are fairly popular, but I'm not going to try to read the tea leaves about that.
2006-12-24
What Google Sitemaps isn't
The Google Sitemaps XML format has a somewhat underdocumented
<priority> field, which is described as:
The priority of this URL relative to other URLs on your site. [...]
The Google documentation is somewhat imprecise here, as there are at least two meanings for 'priority' and they don't really say which one they mean. The first sort of priority is 'which pages do I want crawled first'; the second sort of priority is 'which pages (within my site) do I want ranked first in search results'.
To cut to the chase: Google Sitemaps <priority> is not the relative
priority in search results (the second sort of priority). It only seems
to influence how Google crawls your site (the first sort of priority),
which probably doesn't really matter unless you have a very large site.
This is disappointing, because when Google Sitemaps was first announced I was really hoping it would help me deal with a perpetual problem: I want my individual blog entries ranked higher than my index pages on search results.
The problem is that, unlike normal sites, blogs have a lot of duplicate content, since various sorts of index pages repeat individual entries wholesale. This means a Google search will result in multiple URLs, which Google has to rank somehow, and you would like the URL for the entry itself to rank highest; it's the most stable (there is no guarantee that the index page will still have the same entries as when Google crawled it) and it's got the least distractions to obscure what the user is looking for (on an index page they have to find the right entry).
It would be nice if there was a way of telling Google about this, short
of telling it not to index your index pages (which I am leery of). Maybe
there is, but if there is it is not the Sitemaps <priority> field.
(Interestingly, Google seems to relatively consistently get this right for some places, such as LiveJournal. I can't help suspecting that they have special tuning for well-known blog sites and blogging packages.)
Of course this hardly matters right now, as Google has been unhappy with CSpace's sitemap for some time now for some mysterious reason. (Yes, I've validated it.)
2006-12-18
A basic principle of website security
In theory I shouldn't have to say this, but in practice I probably do. One of the most basic principle of designing secure websites is simply this:
Never trust anything you get from the network.
Everything you get from the network is under the control of a
sufficiently determined attacker, no matter how it is 'supposed' to
be generated. Every request, every form POST, every cookie, and
every AJAX callback. No amount of obfuscation can do more than slow an
attacker down.
(In fact, obfuscation and attempts to hide things are a useful signpost to would-be attackers of where to look closely, a lesson I believe I learned from Harry Harrison's Stainless Steel Rat.)
As a bonus to not trusting network input, you'll gain resilience against the various badly coded crawlers and web browsers that send you crazy things from time to time.
(This grump was sparked by reading this (from Slashdot), which beats around the fundamental bush a bit too much for my taste. I suppose this is what I get for following a Slashdot link.)
2006-12-17
How to get me to block your web ads in a flash
The animation in Web display ads is outta control, outta control, I tell ya!
What he said (except that it's been going on for years). The fastest and best way to get me to really kick your ads to the curb has always been to make them blink and get in my face, and it amazes me that anyone has ever thought such ads were a good idea.
(Huge, page-disrupting and modem-saturating ads don't help, but they are not as scream and leap as frenzied animation is.)
It also amazes me that people are willing to run ads on their pages that are taking deliberate steps to be more attention grabbing than the content. Apart from people who are only in it for the drive-by ad revenue from suckers, it seems self defeating to any efforts to build a long-term audience, since you usually only have a relatively brief chance to hook a first time visitor and persuade them to come back later, and what is presumably going to hook them is your content, not the collection of blinking stuff, so you want your content to be what their attention naturally settles on first.
(It's easy to see why blinking things are attractive to ad people, since it is well established that people reflexively pay attention to apparent motion and change. But this doesn't mean that using this low-level hook is a good idea or is likely to accomplish your actual goals, as most people who've tried using the <blink> tag can probably testify.)
Sidebar: what I use to get rid of ads
I've used the junkbuster filtering HTTP proxy since at least 1997 or so. Although it has limitations (like no HTTP/1.1 support), I still prefer it to in-browser solutions like Firefox's AdBlock, partly because I find its simple plain text configuration easier to manipulation. (And of course when I started using it, a filtering proxy was the only real option.)
I've never bothered upgrading to privoxy, the current line of junkbuster development, partly because junkbuster works fine for me so I have very little motivation to change. I did wind up trying privoxy out recently (due to people starting to stuff ads into RSS feeds), and my general view is that boy does it seem to have a complicated configuration system.
2006-12-10
An irony of web serving
One of the small paradoxes of the web is that it is often the connections with the least bandwidth that put the largest load on your web server.
This is because each connection consumes a certain amount of server resources, ranging from kernel data structures for socket buffers up to an entire thread or process on a dynamic website. The slower someone's connection, the longer they tie up up this stuff on your end as you slowly feed them data. Conversely, people on fast connections get in, get their data, and get out fast, letting you release those resources.
Among other little effects, this means that it's not enough for load testing to pick a connections per second rate that you should be able to deal with. Ten new connections a second where each client takes a tenth of a second to download its content is rather easier to deal with than ten connections a second where each client takes ten seconds. (And if you test across a local LAN you are far more likely to get the former than the latter.)
There are some ways around parts of this effect:
- web servers based around asynchronous IO generally have far lower
per-connection overhead, which is one reason they're so popular.
- reverse proxy web servers (including in a way Apache running CGI
programs) offer you a way of rapidly sucking the content out of
your high-overhead dynamic website system and parking it in a
low-overhead frontend web server while it trickles out to the
slow clients, instead of having the slow clients hold down an
expensive connection directly with the dynamic website bits.
(This only works well if your generated content is small enough to get sucked completely into the frontend, but this is the usual case.)
- some websites just disconnect clients after a certain amount of time, whether or not they are still transferring data. This is most popular for bulk downloads, where it's cheap for the server to start again if (or when) the client reconnects to resume the transfer.