Wandering Thoughts archives


Link: Search engine page size limits for indexing

Search Engine Indexing Limits: Where Do the Bots Stop? takes an experimental approach to seeing how big a page various search engine bots will fetch, and how much of large pages they index. I find this an interesting question because it affects how you organize your content and generate indexes to it, especially for dynamic websites with auto-generated aggregate pages.

One area not investigated in the article is how far down the pages the search engine bots will go looking for links to follow. I smell a followup project for someone.

(From Ned Batchelder, who has interesting information on the size of his own blog pages as a result of this.)

links/SearchPageSizeLimits written at 23:18:26; Add Comment

SCGI versus FastCGI

SCGI and FastCGI are both CGI 'replacements', in that they are protocols for forwarding HTTP requests to persistent daemons instead of starting a possibly big, heavyweight program for each request. Ideally your web server will have a built in gateway for them; less ideally you can run a tiny, fast to start CGI program to talk the protocol to your persistent daemon. There's discussions around the Internet about which one is better; you can find people on both sides, and to some extent it depends on what your web server supports best (lighttpd seems to prefer FastCGI, for example).

(This is a good overview of the whole subject and its history.)

From my perspective SCGI is the clear winner, for a basic reason: SCGI is simple enough to actually implement.

FastCGI is a very complicated protocol (see here), with all sorts of features and a bunch of worrying about efficiency. SCGI is dirt simple; the specification is only 100 lines long, and you can implement either end of it in the language of your choice in an hour or so. Since I'm not just plugging existing components together, this difference is important.

(Some people have even reported that SCGI's simplicity means that it runs faster than FastCGI in practice.)

For all the extra complexity of FastCGI, all I seem to get is the ability to send standard error back to the web server in addition to standard output. I think I can live without that. (Of course it's hard to tell if that's all, since the FastCGI specification is so large.)

I like to think that there's a general idea lurking here: simple protocols and things are often better because they are simpler to use and to integrate into environments. It's rare that existing pieces (that support some complicated protocol) are all perfect fits for what people want to do; when you have to adopt things, the simpler they are the easier it is.

Sidebar: what about plain old HTTP?

There's some opinions that the real answer is for the persistent daemon that does the real work to just speak HTTP directly and have requests proxied to it. I am personally dubious about this; I would much rather delegate the job of dealing with all of the tricky, complex bits of HTTP to a dedicated program, ie the web server. Ian Bicking has another set of reasons for coming to the same conclusion.

(Another person arguing the non-HTTP side, for reasons pretty similar to mine, is here.)

programming/SCGIvsFastCGI written at 19:42:59; Add Comment

Weekly spam summary on May 6th, 2006

This week, we:

  • got 11,443 messages from 213 different IP addresses.
  • handled 15,802 sessions from 820 different IP addresses.
  • received 219,841 connections from at least 43,156 different IP addresses.
  • hit a highwater of 50 connections being checked at once, reaching it Monday.

Connection volume is up significantly from the extrapolated levels of last week. All of this is despite us being down for about half of Sunday, due to a drive failure and needing to fix it. The per day table is very interesting, though:

Day Connections different IPs
Sunday 6,518 +2,602
Monday 22,737 +6,621
Tuesday 19,300 +6,684
Wednesday 23,372 +6,488
Thursday 22,592 +5,987
Friday 22,169 +8,218
Saturday 103,153 +6,556

You can see the Sunday effects, and I have nothing to say about this Saturday except AIEEE. I rather suspect that there is a major spam storm going on at the moment.

Kernel level packet filtering top ten:

Host/Mask           Packets   Bytes         6045    290K       5433    268K         4274    205K        3284    158K          2748    136K          2392    124K          2241    108K          2193    105K          2166    110K           2045    104K

It's pretty much the week of DNS blocklists:

  • is a Hong Kong IP address with bad reverse DNS.
  • is in the DSBL.
  • and are in the ORDB.
  • is in NJABL.
  • kept hammering on us after attempting delivery to a spamtrap; I suspect it's phish spam from the MAIL FROM address.

(The usual difference is that advance fee fraud spam exploits badly administered webmail systems and so has MAIL FROM addresses that look like individual user names, whereas phish spam exploits insecure web servers and thus has MAIL FROM addresses with usernames like httpd, apache, root, nobody, test, and so on.)

Connection time rejection stats:

  41638 total
  19232 dynamic IP
  18044 bad or no reverse DNS
   2279 class bl-cbl
    481 class bl-njabl
    409 class bl-ordb
    255 class bl-spews
    167 class bl-dsbl
     48 class bl-sdul
     28 class bl-sbl
      3 class bl-opm

In completely unsurprising news (given the spam storm), 24 of the top 30 most rejected IP addresses were rejected more than 100 times; the champion was with 259 rejected connections. 23 of the top 30 are currently in the CBL and 13 of them are currently in bl.spamcop.net.

The Hotmail numbers are at pretty much an all-time low, although they still collect one black eye:

  • No messages accepted.
  • No messages rejected because they came from non-Hotmail email addresses.
  • 3 messages sent to our spamtraps.
  • No messages refused because their sender addresses had already hit our spamtraps.
  • 1 message refused due to its origin IP address being in SBL17935, listed since January 17th, 2006.

Of course Hotmail is still batting zero since no real Hotmail people actually sent us email this week, but at least they're not swinging very much.

And the final set of numbers:

what # this week (distinct IPs) # last week (distinct IPs)
Bad HELOs 405 46 346 40
Bad bounces 8 7 29 23

On the bad HELOs front, the most active source was, with 100 tries; the next was with only 57. The bad bounces number is completely surprising; at this level, I can actually look at each session. While some of the bounces are to completely bogus user names, some are to what are now spamtrap addresses here. I don't know what this means; have spammers started mining their target lists for MAIL FROMs?

The user name patterns for the bad bounces:

  • last week saw 4 each to id and noreply, 11 more between four spamtraps, then one each to a mix of spamtraps, random sequences like c301ymxlp, and some entirely numeric user names like 72.
  • this week saw 2 to costauvqaagmlp, 4 to spamtraps, one to entranceway, and one to the 38-character hex sequence 8B407639D45C5742ADD3987F7E013C410F82BC.

Conclusion: spammers are strange.

spam/SpamSummary-2006-05-06 written at 02:42:49; Add Comment

Page tools: See As Normal.
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.