2006-05-07
Link: Search engine page size limits for indexing
Search Engine Indexing Limits: Where Do the Bots Stop? takes an experimental approach to seeing how big a page various search engine bots will fetch, and how much of large pages they index. I find this an interesting question because it affects how you organize your content and generate indexes to it, especially for dynamic websites with auto-generated aggregate pages.
One area not investigated in the article is how far down the pages the search engine bots will go looking for links to follow. I smell a followup project for someone.
(From Ned Batchelder, who has interesting information on the size of his own blog pages as a result of this.)
SCGI versus FastCGI
SCGI and FastCGI are both CGI 'replacements', in that they are protocols for forwarding HTTP requests to persistent daemons instead of starting a possibly big, heavyweight program for each request. Ideally your web server will have a built in gateway for them; less ideally you can run a tiny, fast to start CGI program to talk the protocol to your persistent daemon. There's discussions around the Internet about which one is better; you can find people on both sides, and to some extent it depends on what your web server supports best (lighttpd seems to prefer FastCGI, for example).
(This is a good overview of the whole subject and its history.)
From my perspective SCGI is the clear winner, for a basic reason: SCGI is simple enough to actually implement.
FastCGI is a very complicated protocol (see here), with all sorts of features and a bunch of worrying about efficiency. SCGI is dirt simple; the specification is only 100 lines long, and you can implement either end of it in the language of your choice in an hour or so. Since I'm not just plugging existing components together, this difference is important.
(Some people have even reported that SCGI's simplicity means that it runs faster than FastCGI in practice.)
For all the extra complexity of FastCGI, all I seem to get is the ability to send standard error back to the web server in addition to standard output. I think I can live without that. (Of course it's hard to tell if that's all, since the FastCGI specification is so large.)
I like to think that there's a general idea lurking here: simple protocols and things are often better because they are simpler to use and to integrate into environments. It's rare that existing pieces (that support some complicated protocol) are all perfect fits for what people want to do; when you have to adopt things, the simpler they are the easier it is.
Sidebar: what about plain old HTTP?
There's some opinions that the real answer is for the persistent daemon that does the real work to just speak HTTP directly and have requests proxied to it. I am personally dubious about this; I would much rather delegate the job of dealing with all of the tricky, complex bits of HTTP to a dedicated program, ie the web server. Ian Bicking has another set of reasons for coming to the same conclusion.
(Another person arguing the non-HTTP side, for reasons pretty similar to mine, is here.)
Weekly spam summary on May 6th, 2006
This week, we:
- got 11,443 messages from 213 different IP addresses.
- handled 15,802 sessions from 820 different IP addresses.
- received 219,841 connections from at least 43,156 different IP addresses.
- hit a highwater of 50 connections being checked at once, reaching it Monday.
Connection volume is up significantly from the extrapolated levels of last week. All of this is despite us being down for about half of Sunday, due to a drive failure and needing to fix it. The per day table is very interesting, though:
Day | Connections | different IPs |
Sunday | 6,518 | +2,602 |
Monday | 22,737 | +6,621 |
Tuesday | 19,300 | +6,684 |
Wednesday | 23,372 | +6,488 |
Thursday | 22,592 | +5,987 |
Friday | 22,169 | +8,218 |
Saturday | 103,153 | +6,556 |
You can see the Sunday effects, and I have nothing to say about this Saturday except AIEEE. I rather suspect that there is a major spam storm going on at the moment.
Kernel level packet filtering top ten:
Host/Mask Packets Bytes 218.189.207.71 6045 290K 212.216.176.0/24 5433 268K 213.253.210.34 4274 205K 213.178.230.131 3284 158K 61.128.0.0/10 2748 136K 222.32.0.0/11 2392 124K 67.138.83.190 2241 108K 213.250.36.13 2193 105K 199.195.71.42 2166 110K 218.0.0.0/11 2045 104K
It's pretty much the week of DNS blocklists:
- 218.189.207.71 is a Hong Kong IP address with bad reverse DNS.
- 213.253.210.34 is in the DSBL.
- 213.178.230.131 and 213.250.36.13 are in the ORDB.
- 67.138.83.190 is in NJABL.
- 199.195.71.42 kept hammering on us after attempting delivery to
a spamtrap; I suspect it's phish spam from the
MAIL FROM
address.
(The usual difference is that advance fee fraud spam exploits badly
administered webmail systems and so has MAIL FROM
addresses that look like individual user names, whereas phish spam
exploits insecure web servers and thus has MAIL FROM
addresses with
usernames like httpd
, apache
, root
, nobody
, test
, and so on.)
Connection time rejection stats:
41638 total 19232 dynamic IP 18044 bad or no reverse DNS 2279 class bl-cbl 481 class bl-njabl 409 class bl-ordb 255 class bl-spews 167 class bl-dsbl 48 class bl-sdul 28 class bl-sbl 3 class bl-opm
In completely unsurprising news (given the spam storm), 24 of the
top 30 most rejected IP addresses were rejected more than 100 times;
the champion was 218.254.83.47 with 259 rejected connections. 23 of
the top 30 are currently in the CBL and 13 of them are currently in
bl.spamcop.net
.
The Hotmail numbers are at pretty much an all-time low, although they still collect one black eye:
- No messages accepted.
- No messages rejected because they came from non-Hotmail email addresses.
- 3 messages sent to our spamtraps.
- No messages refused because their sender addresses had already hit our spamtraps.
- 1 message refused due to its origin IP address being in SBL17935, listed since January 17th, 2006.
Of course Hotmail is still batting zero since no real Hotmail people actually sent us email this week, but at least they're not swinging very much.
And the final set of numbers:
what | # this week | (distinct IPs) | # last week | (distinct IPs) |
Bad HELO s |
405 | 46 | 346 | 40 |
Bad bounces | 8 | 7 | 29 | 23 |
On the bad HELO
s front, the most active source was 205.150.71.250,
with 100 tries; the next was 217.197.167.34 with only 57. The bad
bounces number is completely surprising; at this level, I can actually
look at each session. While some of the bounces are to completely bogus
user names, some are to what are now spamtrap addresses here. I don't
know what this means; have spammers started mining their target lists
for MAIL FROM
s?
The user name patterns for the bad bounces:
- last week saw 4 each to
id
andnoreply
, 11 more between four spamtraps, then one each to a mix of spamtraps, random sequences likec301ymxlp
, and some entirely numeric user names like72
. - this week saw 2 to
costauvqaagmlp
, 4 to spamtraps, one toentranceway
, and one to the 38-character hex sequence8B407639D45C5742ADD3987F7E013C410F82BC
.
Conclusion: spammers are strange.