Wandering Thoughts archives

2011-08-26

The problem for WSGI middleware

I've recently come to a (belated) realization about WSGI.

Looked at from the right perspective, WSGI is a simple protocol hiding inside a complex one (or you could say that it's designed to make simple things relatively simple and complex things possible). Almost all WSGI applications are written to the simple protocol and I think that many server implementations effectively are as well, especially ones written in specialized circumstances for private use.

This is great for application writers; most application writers only need to do simple things that fit nicely into the simple protocol. It's okay for server implementors who have relatively simple needs, since they can mostly copy from the sample PEP implementation and forget about it.

(It's not so great for server implementors who really do need the complex stuff; not only do they have to implement it, but they may find all sorts of other people's WSGI code that chokes on it.)

But it's terrible for middleware developers, because they never get to deal with only the simple subset of WSGI. At least in theory, middleware has to support the full complexity of the WSGI protocol including all of the odd and basically never used bits, because it might someday be used in a sufficiently complex and perverse environment. Then much of that code sits there unused almost all of the time, because after all most of the time WSGI is used in simple ways. The frequent result is that either people don't write middleware at all or they write incomplete middleware that only works in some WSGI environments. Sometimes they write incomplete middleware knowingly and deliberately, because it's all they need, and sometimes they write incomplete middleware without realizing it.

(PJ Eby has recently said that basically all WSGI middleware libraries he's looked at have bugs handling sufficiently perverse but legal WSGI environments.)

(I've written about this issue before from a somewhat different perspective in WSGIGoodBad. There I was focused more on the intrinsic complexity that middleware has even when used in the simple WSGI environment, partly because at the time I didn't understand how complex WSGI could get. WSGI middleware effectively has two sorts of complexity to deal with, but that's another entry.)

Sidebar: where I think middleware complexity lives

My intuition is that there are three general cases for middleware, in order of increasing complexity. The simplest case is middleware that either intercepts the request, acting as an application, or passes it down unmodified. The middle case is middleware that modifies the request before passing it down to the next layer. The most complex and problematic case is middleware that must modify the response as it comes back, especially if it has to do something before the request is passed down.

Oh, and only request headers are particularly easy to modify. If you need to modify the request body for a POST request, well, I think you have a bunch of pain coming.

WSGIMiddlewareProblem written at 01:17:56; Add Comment

2011-08-18

You should always use super()

I've done it (and now I'm embarrassed by that). I've seen other people do it by accident (to pick something I was recently looking at). And yes, I know, using super() is kind of annoying (Python 3 makes it much nicer). But still: you should always use super() instead of directly calling up to your parent class.

(It's mostly a coincidence that both examples I gave here are metaclasses, except that a metaclass is one of the few cases in Python where you have to call up to a built-in type.)

It may not be obvious to people what problems not using super() causes, so I'll give an example in the traditional illustrated form:

class Meta1(type):
  def __new__(meta, name, bases, odict):
    print "meta1 new"
    return type(name, bases, odict)

class Meta2(type):
  def __new__(meta, name, bases, odict):
    print "meta2 new"
    return type(name, bases, odict)

class Joined(Meta1, Meta2):
  pass

class Testing(object):
  __metaclass__ = Joined
  def frotz(self):
    print "frotzing", self

Imagine that you have two metaclasses, each of which does something useful to your class. You want to combine them, getting both effects, so you create the Joined metaclass to multi-inherit from them. But when you try it you discover that despite the multi-inheritance, only one of them is actually taking effect; in this example, you'll only see 'meta1 new' printed when you want to see both that and 'meta2 new'. Consistently using super() avoids this problem and makes sure that all metaclasses get invoked (in what is considered the right order). A similar situation happens any time more than one class is modifying a single method.

(Metaclasses are especially prone to this because they're one of the cases where several classes commonly need to modify the exact same method. Many other cases of multiple inheritance and mixin classes have the additional classes overriding disjoint methods.)

The need for super() is interesting partly because it's one of the few places where a Python class can easily close itself to later outside modification. Python classes are in general famously 'open'; outside people can reach into them to change things, forcefully subclass them, and otherwise do things that the original class might object to (and that other languages allow the original class to prevent). But a class that fails to use super() (whether through neglect or deliberate choice) can't be safely used in a multiple inheritance situation; you can't use it as a mixin or mix other things in with it. And it's not something that is at all easily fixed from outside, since you can't just straightforwardly subclass the class and replace the particular errant method.

(If you have a single non-super()-using class, you can sometimes manage to manipulate the method resolution order so that the 'bad' class is the last class and so things work. Among other things, this is quite fragile.)

AlwaysUseSuper written at 00:54:32; Add Comment

2011-08-16

An interesting way to shoot yourself in the foot and a limitation of super()

Presented in the traditional illustrated form, for Python 2:

class Base(object):
  def foo(self):
    print "base foo"

class A(Base):
  def foo(self):
    print "A foo"
    return super(A, self).foo()

class B(object):
  def foo(self):
    print "B foo"

old_A = A
A = B

old_A().foo()

(If you think that this is a completely silly and crazy example, consider shimming modules for tests where you might need to transparently wrap another module's class.)

If you are an normal innocent Python programmer, this probably looks like it ought to work; after all, you're using super() exactly as the instructions tell you to. If you run it, though, you will get an interesting error:

TypeError: super(type, obj): obj must be an instance or subtype of type

In many languages, the 'super(A, self)' would have been what I'll call 'early binding'; in the context of foo(), the name 'A' would have been immediately bound to the class itself. Renaming the class later (if it's even possible) would make no difference to the line, because the binding had already been established. Python doesn't work that way. The binding of 'A' to a class is only looked up as foo()'s code is executing, and by that time 'A' points to the wrong thing; it points to the completely unrelated class B, and so the call is actually doing 'super(B, self)' (hence the error message). You can see this directly by using the dis module to inspect the bytecode for foo; it contains a LOAD_GLOBAL to look up the value of A, instead of any direct reference to the class.

In fact Python can't work that way. Because of how class definition works combined with how functions are defined, the name A does not exist when the (method) function foo() is being defined and its code is being compiled, so even if Python wanted to do early binding there's nothing to bind to. This late binding is the only choice Python has.

This has an unfortunate consequence; it makes using super() one of the few places in Python 2 where you absolutely have to know what your class is called (well, have some name for it), where by 'your class' I mean the class that the code is in (as opposed to the class of self, which is easy to determine). Of course, normally this is not an issue.

(super() needs to know this because it has to find where you are in the method resolution order, which requires knowing what class the code is in.)

PS: this issue with super() is fixed in Python 3 through methods beyond the scope of this entry (see here if you really want to see the sausage being made).

LateBindingSuper written at 00:12:15; Add Comment

2011-08-07

What I want out a Symbol type in Python

Back in OptionalArgumentsIssue, I wrote that special unique sentinel values were one of the cases where it would be nice to have a real symbol type in Python (ideally including syntactic support). This raises the question of how such a type would behave and what sort of special support it needs.

Unfortunately, what I really want requires language support. Without it, the best we can do is something like this:

>>> no_arg = Symbol("no_arg")
>>> print no_arg
Symbol("no_arg")
>>> no_arg is Symbol("no_arg")
False

This gives us unique named objects and lets us us dump the symbol's name (if we gave it one) to help with debugging, but that's it. What we really want is not to have to repeat the symbol's name when we create it and for symbols to be unique in the current module; ideally there would be a less verbose syntax for creating symbols, too.

(We don't want named symbols to be globally unique, because that risks confusing my module's no_arg symbol with your module's no_arg symbol; implicit global namespaces are a bad idea. At the same time fully unique named symbols are somewhat absurd, as illustrated here. Module unique symbols are a reasonable compromise since we can say they correspond reasonably well to module level variables.)

I believe that in theory we can create module unique named symbols and mostly avoid repetition in pure Python code. However it would require relatively ugly hacks behind the back of CPython, hacks that would surprise people reading code that used symbols and not be fully reliable. To do this properly you really need interpreter support, and that should really come with special syntax so that people understand how special these symbols are.

(Unfortunately I'm not sure that Ruby's :name syntax for symbols would work given how Python syntax already uses : in various places.)

Sidebar: a really basic Symbol class

We can construct a really basic Symbol type fairly simply:

class Symbol(object):
  __slots__ = ('name',)
  def __init__(self, name = None):
    self.name = name

  def __repr__(self):
    if not self.name:
      return super(Symbol, self).__repr__()
    return "<Symbol: %s>" % self.name

  def __str__(self):
    if not self.name:
      return super(Symbol, self).__str__()
    return "Symbol('%s')" % self.name

In this version, named symbols are fully unique.

SymbolTypeDesire written at 01:18:16; Add Comment

2011-08-03

How I encode and decode the milter protocol (or, how to write a codec for a sane binary protocol)

After all of my worrying and investigation of modules to handle protocols for me, I wound up writing an encoder and a decoder for sendmail's milter protocol by hand because it was the simplest way. This is because the milter protocol is a sane binary protocol and it turns out that there's a straightforward way (at least in a dynamic language like Python) to write a codec for such a protocol.

As a sane binary protocol, the milter protocol starts with a packet format:

uint32  len
char    cmd
char    data[len-1]

The cmd byte is the message type, which determines the structure of data. Each message has a fixed structure; there are some fixed number of fields, each of which is one of a small number of primitive field types. (All of this is what you'd expect for a sane binary protocol.)

First you need an encoder and a decoder for every primitive field type and some way that you can find them given a field type. I wound up with eight types for the milter protocol, although you could do it with seven, and I used a simple mapping dict:

codectypes = {
  'buf': (encode_buf, decode_buf),
  'u16': (encode_u16, decode_u16),
  ....
}

(Many of these routines were slight variants on each other; with the right support routines, actual encoders and decoders were mostly two lines per type. In the end I opted not to play fancy tricks with namespaces, partly because I like having simple two-line functions.)

With the field types defined, we can now define each message as a sequence of named fields, each with a type. Again there are lots of ways to encode this data and I used brute force:

messages = {
  SMFIC_HEADER: (('name', 'str'),
                 ('value', 'str')),
  ....
}

To decode a message you first read the entire packet (which you can do without knowing anything about the message's structure), then look up the cmd in the messages table to determine the message structure. For each field in the message, you decode an item of the given primitive type and store it under the field name; at the end of decoding, you should have nothing left un-decoded in data (and you should not have run out). You return the cmd byte and a dictionary of all of the fields.

Encoding is the inverse process. You are given the cmd byte and a message dictionary. You look up the message structure in messages, then walk the list; for each named field, you extract its value from the dictionary, encode it as the given primitive type, and concatenate the resulting raw bytes to your data. When the message is fully encoded, you determine len and wrap the whole thing up as a packet.

(My implementation of encoding took this a step further in laziness by using keyword arguments to the encoding function to create the message dictionary; you invoke it as encode_msg(SMFIC_HEADER, name="foo", value="bar").)

This requires minimal code and the code it does need is mostly generic. The actual process of encoding and decoding is data-driven; the protocol itself is basically specified in the messages dictionary, and adding new messages is trivial as long as they use existing primitive field types. Repeated boilerplate code is basically completely eliminated.

(This requires a dynamic language partly because it heavily relies on polymorphic argument handling and the ability to ship values around without the intermediate generic encoding and decoding layers having to care what type they are. If you had to do the usual strict static typing, you'd probably need a separate encoding function for each message and I'm not sure how you'd handle decoding.)

On a side note, this means that I need to take back some of the nasty things I said about the milter protocol back a year ago. Particularly, it does not have messages with a variable number of message fields. (I misread that part of the specification earlier.)

HowMilterCodec written at 01:14:26; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.