Another really stupid web spider
I have to take back what I said just the other day about having seen the worst stealth spider I'd ever seen. The day after I wrote that entry, I saw a worse case.
Like the first one, they made two valid requests to start with and then followed them up with 65 bad ones, all in the span of 11 seconds. Also like the first one, all their requests were bad because they couldn't deal with absolute paths in <a href="...">. But they topped the first one because they lowercased all the URLs.
This right here is me clutching my head like a stunned monkey.
All 67 requests came from 22.214.171.124; according to theplanet.com, this is part of 126.96.36.199/29, assigned to 'PQC Service, LLC' of Wilmington Delaware, zip code 19801. All of the machines have generic reverse DNS. (Some Googling suggests that the company runs porn sites.)
The practical cost of forking in Python
I spent part of the other day working to speed up an SCGI based program, and wound up hitting a vivid illustration of the practical cost of forking in Python. I'll start with the numbers:
- 5.3 milliseconds per request when the program forked a child to handle each request.
- 1.1 milliseconds per request when the forking was stubbed out so each request ran in the main process.
Benchmarking was done with Apache's
ab, running on the same machine
(and with only one request at a time, since the non-forking version
obviously can't handle concurrent requests).
These numbers are pure SCGI overhead; the program had its usual response handler stubbed out to a special null handler that just returned a short hard-coded response, and it was directly connected to lighttpd. (Some work suggests that most of the remaining 1.1 millisecond is in decoding the request's initial headers; I'm not sure how to speed this up.)
Since I have a thread pool package lying around, I hacked the SCGI server up to use it instead of forking; the performance stayed around 1.1 milliseconds per request, somewhat to my surprise.
I don't have any explanation of why Python takes 4.2 milliseconds more
when I fork for each request. The direct cost of
fork() with all of
the program's modules imported is about 1.3 milliseconds (the fork
tax varies with how many dynamic
libraries the Python process has loaded, so it's important to measure
with your program's actual set of imports). Forking does require extra
management code to do things like track and reap dead children, but 2.9
milliseconds seems a bit high for it.