2011-08-26
The problem for WSGI middleware
I've recently come to a (belated) realization about WSGI.
Looked at from the right perspective, WSGI is a simple protocol hiding inside a complex one (or you could say that it's designed to make simple things relatively simple and complex things possible). Almost all WSGI applications are written to the simple protocol and I think that many server implementations effectively are as well, especially ones written in specialized circumstances for private use.
This is great for application writers; most application writers only need to do simple things that fit nicely into the simple protocol. It's okay for server implementors who have relatively simple needs, since they can mostly copy from the sample PEP implementation and forget about it.
(It's not so great for server implementors who really do need the complex stuff; not only do they have to implement it, but they may find all sorts of other people's WSGI code that chokes on it.)
But it's terrible for middleware developers, because they never get to deal with only the simple subset of WSGI. At least in theory, middleware has to support the full complexity of the WSGI protocol including all of the odd and basically never used bits, because it might someday be used in a sufficiently complex and perverse environment. Then much of that code sits there unused almost all of the time, because after all most of the time WSGI is used in simple ways. The frequent result is that either people don't write middleware at all or they write incomplete middleware that only works in some WSGI environments. Sometimes they write incomplete middleware knowingly and deliberately, because it's all they need, and sometimes they write incomplete middleware without realizing it.
(PJ Eby has recently said that basically all WSGI middleware libraries he's looked at have bugs handling sufficiently perverse but legal WSGI environments.)
(I've written about this issue before from a somewhat different perspective in WSGIGoodBad. There I was focused more on the intrinsic complexity that middleware has even when used in the simple WSGI environment, partly because at the time I didn't understand how complex WSGI could get. WSGI middleware effectively has two sorts of complexity to deal with, but that's another entry.)
Sidebar: where I think middleware complexity lives
My intuition is that there are three general cases for middleware, in order of increasing complexity. The simplest case is middleware that either intercepts the request, acting as an application, or passes it down unmodified. The middle case is middleware that modifies the request before passing it down to the next layer. The most complex and problematic case is middleware that must modify the response as it comes back, especially if it has to do something before the request is passed down.
Oh, and only request headers are particularly easy to modify. If you
need to modify the request body for a POST request, well, I think
you have a bunch of pain coming.
2011-08-18
You should always use super()
I've done it (and now I'm embarrassed by
that). I've seen other people do it by accident
(to pick something I was recently looking at). And yes, I know,
using super() is kind of annoying (Python 3
makes it much nicer). But still: you should always use super()
instead of directly calling up to your parent class.
(It's mostly a coincidence that both examples I gave here are metaclasses, except that a metaclass is one of the few cases in Python where you have to call up to a built-in type.)
It may not be obvious to people what problems not using super()
causes, so I'll give an example in the traditional illustrated
form:
class Meta1(type):
def __new__(meta, name, bases, odict):
print "meta1 new"
return type(name, bases, odict)
class Meta2(type):
def __new__(meta, name, bases, odict):
print "meta2 new"
return type(name, bases, odict)
class Joined(Meta1, Meta2):
pass
class Testing(object):
__metaclass__ = Joined
def frotz(self):
print "frotzing", self
Imagine that you have two metaclasses, each of which does something
useful to your class. You want to combine them, getting both effects,
so you create the Joined metaclass to multi-inherit from them. But
when you try it you discover that despite the multi-inheritance, only
one of them is actually taking effect; in this example, you'll only
see 'meta1 new' printed when you want to see both that and 'meta2
new'. Consistently using super() avoids this problem and makes sure
that all metaclasses get invoked (in what is considered the right order).
A similar situation happens any time more than one class is modifying
a single method.
(Metaclasses are especially prone to this because they're one of the cases where several classes commonly need to modify the exact same method. Many other cases of multiple inheritance and mixin classes have the additional classes overriding disjoint methods.)
The need for super() is interesting partly because it's one of the few
places where a Python class can easily close itself to later outside
modification. Python classes are in general famously 'open'; outside
people can reach into them to change things, forcefully subclass
them, and otherwise do things that the original
class might object to (and that other languages allow the original
class to prevent). But a class that fails to use super() (whether
through neglect or deliberate choice) can't be safely used in a multiple
inheritance situation; you can't use it as a mixin or mix other things
in with it. And it's not something that is at all easily fixed from
outside, since you can't just straightforwardly subclass the class and
replace the particular errant method.
(If you have a single non-super()-using class, you can
sometimes manage to manipulate the method resolution order so that the 'bad' class is the last class and so
things work. Among other things, this is quite fragile.)
2011-08-16
An interesting way to shoot yourself in the foot and a limitation of super()
Presented in the traditional illustrated form, for Python 2:
class Base(object):
def foo(self):
print "base foo"
class A(Base):
def foo(self):
print "A foo"
return super(A, self).foo()
class B(object):
def foo(self):
print "B foo"
old_A = A
A = B
old_A().foo()
(If you think that this is a completely silly and crazy example, consider shimming modules for tests where you might need to transparently wrap another module's class.)
If you are an normal innocent Python programmer, this probably looks
like it ought to work; after all, you're using super() exactly as
the instructions tell you to. If you run it, though, you will get an
interesting error:
TypeError: super(type, obj): obj must be an instance or subtype of type
In many languages, the 'super(A, self)' would have been what I'll call
'early binding'; in the context of foo(), the name 'A' would have
been immediately bound to the class itself. Renaming the class later
(if it's even possible) would make no difference to the line, because
the binding had already been established. Python doesn't work that
way. The binding of 'A' to a class is only looked up as foo()'s code
is executing, and by that time 'A' points to the wrong thing; it points
to the completely unrelated class B, and so the call is actually doing
'super(B, self)' (hence the error message). You can see this directly
by using the dis module to
inspect the bytecode for foo; it contains a LOAD_GLOBAL to look up
the value of A, instead of any direct reference to the class.
In fact Python can't work that way. Because of how class definition
works combined with how functions are defined, the name A does not exist when the (method)
function foo() is being defined and its code is being compiled, so
even if Python wanted to do early binding there's nothing to bind
to. This late binding is the only choice Python has.
This has an unfortunate consequence; it makes using super() one of the
few places in Python 2 where you absolutely have to know what your class
is called (well, have some name for it), where by 'your class' I mean
the class that the code is in (as opposed to the class of self, which
is easy to determine). Of course, normally this is not an issue.
(super() needs to know this because it has to find where you are in
the method resolution order, which requires
knowing what class the code is in.)
PS: this issue with super() is fixed in Python 3 through methods
beyond the scope of this entry (see here
if you really want to see the sausage being made).
2011-08-07
What I want out a Symbol type in Python
Back in OptionalArgumentsIssue, I wrote that special unique sentinel values were one of the cases where it would be nice to have a real symbol type in Python (ideally including syntactic support). This raises the question of how such a type would behave and what sort of special support it needs.
Unfortunately, what I really want requires language support. Without it, the best we can do is something like this:
>>> no_arg = Symbol("no_arg")
>>> print no_arg
Symbol("no_arg")
>>> no_arg is Symbol("no_arg")
False
This gives us unique named objects and lets us us dump the symbol's name (if we gave it one) to help with debugging, but that's it. What we really want is not to have to repeat the symbol's name when we create it and for symbols to be unique in the current module; ideally there would be a less verbose syntax for creating symbols, too.
(We don't want named symbols to be globally unique, because that risks
confusing my module's no_arg symbol with your module's no_arg
symbol; implicit global namespaces are a bad idea. At the same time
fully unique named symbols are somewhat absurd, as illustrated here.
Module unique symbols are a reasonable compromise since we can say they
correspond reasonably well to module level variables.)
I believe that in theory we can create module unique named symbols and mostly avoid repetition in pure Python code. However it would require relatively ugly hacks behind the back of CPython, hacks that would surprise people reading code that used symbols and not be fully reliable. To do this properly you really need interpreter support, and that should really come with special syntax so that people understand how special these symbols are.
(Unfortunately I'm not sure that Ruby's :name syntax for symbols would
work given how Python syntax already uses : in various places.)
Sidebar: a really basic Symbol class
We can construct a really basic Symbol type fairly simply:
class Symbol(object):
__slots__ = ('name',)
def __init__(self, name = None):
self.name = name
def __repr__(self):
if not self.name:
return super(Symbol, self).__repr__()
return "<Symbol: %s>" % self.name
def __str__(self):
if not self.name:
return super(Symbol, self).__str__()
return "Symbol('%s')" % self.name
In this version, named symbols are fully unique.
2011-08-03
How I encode and decode the milter protocol (or, how to write a codec for a sane binary protocol)
After all of my worrying and investigation of modules to handle protocols for me, I wound up writing an encoder and a decoder for sendmail's milter protocol by hand because it was the simplest way. This is because the milter protocol is a sane binary protocol and it turns out that there's a straightforward way (at least in a dynamic language like Python) to write a codec for such a protocol.
As a sane binary protocol, the milter protocol starts with a packet format:
uint32 len char cmd char data[len-1]
The cmd byte is the message type, which determines the structure of
data. Each message has a fixed structure; there are some fixed number
of fields, each of which is one of a small number of primitive field
types. (All of this is what you'd expect for a sane binary protocol.)
First you need an encoder and a decoder for every primitive field type and some way that you can find them given a field type. I wound up with eight types for the milter protocol, although you could do it with seven, and I used a simple mapping dict:
codectypes = {
'buf': (encode_buf, decode_buf),
'u16': (encode_u16, decode_u16),
....
}
(Many of these routines were slight variants on each other; with the right support routines, actual encoders and decoders were mostly two lines per type. In the end I opted not to play fancy tricks with namespaces, partly because I like having simple two-line functions.)
With the field types defined, we can now define each message as a sequence of named fields, each with a type. Again there are lots of ways to encode this data and I used brute force:
messages = {
SMFIC_HEADER: (('name', 'str'),
('value', 'str')),
....
}
To decode a message you first read the entire packet (which you can do
without knowing anything about the message's structure), then look up
the cmd in the messages table to determine the message structure.
For each field in the message, you decode an item of the given primitive
type and store it under the field name; at the end of decoding, you
should have nothing left un-decoded in data (and you should not have
run out). You return the cmd byte and a dictionary of all of the
fields.
Encoding is the inverse process. You are given the cmd byte and a
message dictionary. You look up the message structure in messages,
then walk the list; for each named field, you extract its value from the
dictionary, encode it as the given primitive type, and concatenate the
resulting raw bytes to your data. When the message is fully encoded,
you determine len and wrap the whole thing up as a packet.
(My implementation of encoding took this a step further in laziness by
using keyword arguments to the encoding function to create the message
dictionary; you invoke it as encode_msg(SMFIC_HEADER, name="foo",
value="bar").)
This requires minimal code and the code it does need is mostly generic.
The actual process of encoding and decoding is data-driven; the protocol
itself is basically specified in the messages dictionary, and adding
new messages is trivial as long as they use existing primitive field
types. Repeated boilerplate code is basically completely eliminated.
(This requires a dynamic language partly because it heavily relies on polymorphic argument handling and the ability to ship values around without the intermediate generic encoding and decoding layers having to care what type they are. If you had to do the usual strict static typing, you'd probably need a separate encoding function for each message and I'm not sure how you'd handle decoding.)
On a side note, this means that I need to take back some of the nasty things I said about the milter protocol back a year ago. Particularly, it does not have messages with a variable number of message fields. (I misread that part of the specification earlier.)