Wandering Thoughts archives

2006-05-10

Another really stupid web spider

I have to take back what I said just the other day about having seen the worst stealth spider I'd ever seen. The day after I wrote that entry, I saw a worse case.

Like the first one, they made two valid requests to start with and then followed them up with 65 bad ones, all in the span of 11 seconds. Also like the first one, all their requests were bad because they couldn't deal with absolute paths in <a href="...">. But they topped the first one because they lowercased all the URLs.

This right here is me clutching my head like a stunned monkey.

All 67 requests came from 69.56.135.218; according to theplanet.com, this is part of 69.56.135.216/29, assigned to 'PQC Service, LLC' of Wilmington Delaware, zip code 19801. All of the machines have generic reverse DNS. (Some Googling suggests that the company runs porn sites.)

web/ReallyStupidSpiderII written at 16:17:40; Add Comment

The practical cost of forking in Python

I spent part of the other day working to speed up an SCGI based program, and wound up hitting a vivid illustration of the practical cost of forking in Python. I'll start with the numbers:

  • 5.3 milliseconds per request when the program forked a child to handle each request.
  • 1.1 milliseconds per request when the forking was stubbed out so each request ran in the main process.

Benchmarking was done with Apache's ab, running on the same machine (and with only one request at a time, since the non-forking version obviously can't handle concurrent requests).

These numbers are pure SCGI overhead; the program had its usual response handler stubbed out to a special null handler that just returned a short hard-coded response, and it was directly connected to lighttpd. (Some work suggests that most of the remaining 1.1 millisecond is in decoding the request's initial headers; I'm not sure how to speed this up.)

Since I have a thread pool package lying around, I hacked the SCGI server up to use it instead of forking; the performance stayed around 1.1 milliseconds per request, somewhat to my surprise.

I don't have any explanation of why Python takes 4.2 milliseconds more when I fork for each request. The direct cost of fork() with all of the program's modules imported is about 1.3 milliseconds (the fork tax varies with how many dynamic libraries the Python process has loaded, so it's important to measure with your program's actual set of imports). Forking does require extra management code to do things like track and reap dead children, but 2.9 milliseconds seems a bit high for it.

python/PythonForkCost written at 02:27:52; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.