2006-06-26
How not to report spam (part 1)
For my sins, I am on one of the aliases here that gets a certain amount of reports of spamming theoretically committed by UofT IP addresses. (I am not one of the people who has to deal with them, fortunately; it is a thankless job). This exposes me to a certain amount of good examples of how not to report spam.
Today's example comes to us from an official government organization in a large Southern American country. All the information they gave us was:
- the date (with the format spelled out: +1)
- the time (with the time zone, as an offset from GMT: +1)
- the sending IP address.
- the 'SMTP ID', apparently something generated by their system.
- the virus type it was identified as.
- the Subject line of the mail.
Unfortunately, the IP address is the IP address of our main outgoing
SMTP gateway. It sends a considerable amount of email, and little
details like the MAIL FROM
and the RCPT TO
of the problematic
message would have been useful.
(Disclaimer: despite my grumbles, Vernon Schryver's remarks about spam complaints definitely apply. Even people making imperfect spam reports are doing us a favour that they don't have to. It would just be faster to fix the issue if we got more information.)
WSGI versus asynchronous servers
Asynchronous servers and frameworks are a popular way to create highly scalable systems. Although WSGI isn't explicitly designed to support them, putting a WSGI application in an asynchronous server isn't totally foolish: many WSGI applications won't be doing anything that can block.
(Technically disk IO can block, but Python on Unix doesn't have any way to do asynchronous disk IO without using threads.)
However, there is one serious fly in the ointment: the WSGI spec
requires a synchronous interface for reading the HTTP request body. You
get it from wsgi.input
, which is specified to be a file-like object.
The spec suggests one way around this: the WSGI server can read the request body from the network (doing so asynchronously) and buffer it all up before invoking the WSGI application. I'm not very fond of this because it makes defending against certain sorts of denial of service attacks much more difficult, as the WSGI server has no idea what the size and time limits of the WSGI application are.
(For example, DWiki rejects all POSTs over 64K without even trying to read them.)
This may seem nit-picky, but building resilient servers is already hard enough that I'm nervous about adding more obstacles.
This is one of those situations when continuations or coroutines would
be pretty handy; the wsgi.input
object could use one or the other to
put the entire WSGI application to sleep until more network input showed
up. (Python's yield
-based coroutines aren't good enough because they
only work with direct function calls; the wsgi.input.read()
method
function can't use yield
to pop all the way back to the WSGI server.)
(I don't fault WSGI for not working easily in asynchronous servers; it's hard to design general interfaces that do, and they're not very natural for synchronous servers. WSGI is sensibly designed for the relatively common case.)