2006-06-26
WSGI versus asynchronous servers
Asynchronous servers and frameworks are a popular way to create highly scalable systems. Although WSGI isn't explicitly designed to support them, putting a WSGI application in an asynchronous server isn't totally foolish: many WSGI applications won't be doing anything that can block.
(Technically disk IO can block, but Python on Unix doesn't have any way to do asynchronous disk IO without using threads.)
However, there is one serious fly in the ointment: the WSGI spec
requires a synchronous interface for reading the HTTP request body. You
get it from wsgi.input, which is specified to be a file-like object.
The spec suggests one way around this: the WSGI server can read the request body from the network (doing so asynchronously) and buffer it all up before invoking the WSGI application. I'm not very fond of this because it makes defending against certain sorts of denial of service attacks much more difficult, as the WSGI server has no idea what the size and time limits of the WSGI application are.
(For example, DWiki rejects all POSTs over 64K without even trying to read them.)
This may seem nit-picky, but building resilient servers is already hard enough that I'm nervous about adding more obstacles.
This is one of those situations when continuations or coroutines would
be pretty handy; the wsgi.input object could use one or the other to
put the entire WSGI application to sleep until more network input showed
up. (Python's yield-based coroutines aren't good enough because they
only work with direct function calls; the wsgi.input.read() method
function can't use yield to pop all the way back to the WSGI server.)
(I don't fault WSGI for not working easily in asynchronous servers; it's hard to design general interfaces that do, and they're not very natural for synchronous servers. WSGI is sensibly designed for the relatively common case.)
2006-06-24
A problem with signals in Python
As a followup to what I wrote yesterday about prompt signal handling in programs, it's worthwhile to point out a little problem in Python with signal handling.
As I've noted before, Python normally turns many signals into exceptions. It turns out that this has an important consequence: it delays processing of signals, because Python only processes signal exceptions in the interpreter, ie when you're running Python bytecodes.
(Contrary to the documentation, sys.setcheckinterval does not appear
to control how often signal handlers are run; the CPython code that
handles signals overrides the check interval value to force an immediate
check.)
This matters because Python can't abort long operations being done in low-level modules, like looking up hostnames and making network connections. When you hit ^C during one of these, all that happens is that Python notes that it's got a pending signal; only when the operation finishes and control returns to the Python interpreter does the signal take effect.
(Python can't abort the operation because there's no way to clean up whatever internal state C code may have (especially in libraries). Just blowing away the operation without cleaning up this state may leave data structures corrupt and cause all sorts of problems later. This isn't specifically a Python issue; portable code can't really do very much more in signal handlers than set a flag.)
If you don't need to catch signals at all, the best way to fix this problem is to set them to SIG_DFL using the recipe in my earlier entry. If you do need to catch the signals to clean stuff up, unfortunately there's no good way out.
Embarrassingly, I've been caught out by this in several of the little
tools I use because it's always seemed like too much hassle to put in
the signal dance, and after all I didn't care if ^C got me an exception
puke instead of a quiet death. It's tempting to make a module that does
it all for me, so I can just put 'import unixy' or the like at the start
of my little programs and have it just work right.
(Well, it'd take a bit more than just an import unless I put the module somewhere on the standard module search path. Details, details.)
2006-06-19
WSGI: the good and the bad
WSGI is the Python 'Web Server Gateway Interface', which is a standard interface between web servers and Python web applications. The idea is that you write your app as a WSGI app, then glue it to the web interfaces of your choice: CGI-BIN, SCGI, a standalone testing web server, whatever. The WSGI stuff encourages a stacking model, where modular middleware can be transparently inserted above applications to handle various more or less generic things.
(Ian Bicking gives examples of a bunch of WSGI middleware here.)
A while back I converted DWiki into a WSGI application, and in the process I built up some opinions about the good and the bad of of WSGI.
The good: the ideas. A generic execution environment for web apps is very nice, and the 'stack of simple pieces' approach to apps is a powerful and addictive idea. Plus, WSGI gateways are relatively easy to write, and for apps the interface is pretty simple.
The bad: the complex implementation, which imposes what I call the 'WSGI tax'.
The WSGI tax exists because WSGI had to fit into several existing web server environments, all different, in order to get people to adopt it. To cope with all of them, the full WSGI protocol has a bunch of complex requirements, and general WSGI servers (including all middleware, since it acts as a server for the layer below it) has to support all of them. Not only does this require a pile of cookie-cutter code in each middleware component, but the requirements significantly complicate what you can do when.
I really like the idea of stackable little bits, and I've been very happy with using it heavily in the post-conversion DWiki. But the WSGI tax is too much for me, so DWiki uses a much simpler internal protocol for its stackable layers and appears as monolithic WSGI app to the outside world.
(For scale, a typical DWiki stackable layer is ten to twenty lines of
Python. The smallest is three lines, not counting the def; another is
five.)
2006-06-05
A Python coding mistake
There is a world of difference between '"string" in txt' and just
'"string"'. It doesn't even really jump out as visually wrong to me
if it's in a large if condition, which probably means the condition
is big enough that it should be reorganized so that errors are more
obvious, or at least harder to make.
(Repetitive code is especially prone to this sort of slip-up for me. Unfortunately, one of my bad coding habits is quietly growing into such code by accident.)
Interestingly enough, pychecker doesn't produce any warnings for this. This isn't its fault as such; although it tries to look for them, it works from the compiled bytecodes and since Python 2.3, the bytecode compiler optimizes away the constant conditions.
Pylint didn't detect it either, although it complained about a lot of other things by default (enough that I can't imagine using it on a real project, especially since I use tabs for indentation).
(This exact mistake accidentally shut down commenting here for about half a day. My apologies to anyone caught in the accident, which would have given peculiar error messages.)
2006-06-02
An object identity gotcha in Python
Consider the problem of implementing a modified LIFO stack that has an
additional O(1) operation called min, that gives back the smallest
item in the stack. For now, let's pretend that Python arrays are O(1)
(or that someone has implemented a native queue class); in many respects
they are as far as Python code is concerned.
The simple implementation is to keep two stacks, the main stack and one that you only push new minimum items onto. Then push and pop are something like this (omitting error checking and so on):
def push(self, itm):
self.stk.append(itm)
if not self.minstk or \
itm < self.minstk[-1]:
self.minstk.append(itm)
def pop(self):
top = self.stk.pop()
if top is self.minstk[-1]:
self.minstk.pop()
return top
Unfortunately, this code has a small bug: it detonates if you push the current minimum object onto the stack a second time. This might be a pernicious bug that lingers unnoticed for some time, since normal Python usage for something like this will probably have entirely distinct objects.
(The quick fix is to use '<=' in push instead of just '<',
which does grow the minstk stack a bit more in some circumstances.)
The really tricky bit about this is that Python will sometimes give you duplicate items behind your back. For example, all of the small integers from -5 to 99 are currently interned by Python; any use of one (including one you get through calculations) give you back the same object.
This obviously only happens for immutable objects, but when it happens is implementation defined (and thus can change from Python version to Python version). It's definitely something to bear in mind when writing generic code that uses object identity.