Wandering Thoughts archives

2011-07-31

Reference counting and multiple inheritance in (C)Python

I recently stumbled over this comment on a LWN article about object oriented design patterns in the Linux kernel. Quoting a bit from the original article, Auders asked:

Though it seems obvious when put this way, it is useful to remember that a single object cannot have two reference counters - at least not two lifetime reference counters [...]. This means that multiple inheritance in the "data inheritance" style is not possible.

The standard CPython implementation of Python uses reference counting, and yet it supports multiple inheritance. How is this possible?

There are two answers. (I'm going to try to write this from the perspective of a C programmer.)

What causes this inheritance problem in C is because of how most C code handles data inheritance. As covered in the LWN series, typically you do inheritance by direct structure embedding; to inherit from struct A, you put struct A in your own struct. This means that your struct's lifetime is tied to the lifetime of the embedded struct A; when A's reference count goes to zero, you will be told to delete your entire structure. If you embed both struct A and struct B, each of them separately reference counted, then A can have its reference count go to zero before B and you will be told to delete your entire structure even though the embedded B is still alive and cannot be deleted.

Python does not do data inheritance by directly embedding structures. Instead each object has a single storage for all fields, regardless of where they come from, and all classes that you inherit from write their fields into it. This is part of what enables Python objects to have a single reference count which is manipulated by everything that takes or releases a reference to the object, regardless of which class's code and data it is working through. This works at the C level because all CPython objects start with a common structure that can be manipulated generically without needing to know what sort of object you're dealing with.

(You could do a single object-wide reference count in C if you wanted to, but it requires extra overhead and only makes sense if your struct A and struct B are purely virtual and must always be subclassed. Instead of directly embedding a reference count in each structure, you'd embed a pointer to the overall object's reference count (and set this in each embedded struct when creating your overall object). You also need to think about whether it does damage to have a dangling struct A that is not referenced from anywhere, because this is what happens when you drop the last reference to A before the overall object can be deleted.)

The other answer is that CPython also has this limit on multiple inheritance but it's more carefully disguised. Because CPython cannot do (C) structure embedding, it simply refuses to let a Python class inherit from two different C-level classes, or in fact from two classes (Python or C-level) that have incompatible object layouts at the C level. Typical Python programs never notice because almost all (Python) classes only inherit from a single C-level class, that being object(). You can safely do multiple inheritance through several paths to the same C-level root class because of how all those instance fields from your Python classes get stuck into the object's generic field storage.

(C-level classes effectively do not inherit from each other.)

Attempting to break this constraint gets you a series of odd error messages depending on what exactly you're trying to do. I've written about various specific manifestations of this before, such as here and here.

(Both answers are true, but the first answer is incomplete.)

RefcountAndMultiInheritance written at 21:47:42; Add Comment

One of my testing little dirty secrets

I recently read yet another article on TDD, and one of the things this article talked about was the benefit of reading tests in order to understand what the code was doing.

When I thought about someone doing this to my tests, I laughed hollowly.

One of the little dirty secrets about the tests I write is that they are, well, slapped together. I almost invariably write tests in the most expedient and brute force way, and I don't particularly write comments about what the tests are testing and how. (Sometimes I will write a one sentence summary of what bit of the API a test is testing, mostly because Python's unittest module encourages this.)

This goes well beyond how I'd rather have clean code and dirty tests. A good part of it is that I still have the mindset that tests are overhead, and the less time I spend on them the more time I have to spend on writing the useful things. Writing clean, carefully commented test code would require a lot more work. A part of it is that most test code is about the most boring, straightforward code that you could imagine; 'repeatedly call this routine with certain inputs and verify that you get certain outputs' is essentially boilerplate, except you can't automate it.

(I have a tendency to make my tests somewhat exhaustive. I'm not content to test that a routine works with one set of inputs, so I want to call anything important with all sorts of arguments to check basic functionality, boundary conditions, and so on.)

When I was essentially developing code for myself, this was sort of acceptable (although not great, since even I can forget what my tests were about if I was away from the code for a while). But since I've been developing code in a more shared environment I've become increasingly conscious of how hard it would be for my co-workers to understand my tests as part of developing a change to my code. Although I'm not certain what the right answer is, I suspect that it is adding more comments and more careful code structure to my tests, even though this is sure to make them slower and more annoying to write (and to revise).

(Probably I should look at how some real projects in the wild structure and document their tests, to see how people who really know TDD deal with this problem.)

MyTestingDarkSecret written at 01:37:00; Add Comment

2011-07-25

A little thing that irritates me about common WSGI implementations

One of the issues that WSGI has to deal with is the question of how to connect a WSGI application and a WSGI server together, or to put it the other way, how to tell a WSGI server what to do to actually invoke your application. The WSGI specification specifically does not cover this issue, considering it a server specific issue. However, something of a standard seems to have grown up in WSGI implementations; you supply a module name (or sometimes a file), and the WSGI server expects to import this and find a callable application object that is your application.

This makes me twitch.

One of the basic rules of writing sensible Python code is that modules should do nothing when simply imported. To the extent that they absolutely must run code (instead of simply defining things), it should be minimal and focused on things like getting their imports set up right. Among other things, the more active code you run at import time the harder it is to diagnose failures and bugs by importing your module and selectively running code from it; the moment you look at your module, it explodes.

Creating a properly configured WSGI application generally requires running code, sometimes quite a lot of code. Yet you must have a callable application object ready at the end of the import. There is a conflict here, and none of the resolutions to it make me very happy.

(I can think of at least three; you can run all of that code at import time, you can write a more complex application front end that defers running it all until the first request, or you can pretend that you can configure your application through entirely passive Python declarations. The latter is the route that Django takes, and I think that it has a number of bad effects.)

What would be a great deal better is if WSGI implementations had instead standardized a function that you called in order to get a callable application object. This would allow you to have a pure and easily importable set of modules while still doing all of your configuration work before the first request came in (instead of stalling the first request while you frantically do all of the work to configure things).

WSGIImportIssue written at 01:36:10; Add Comment

2011-07-14

Ramblings on handling optional arguments in Python

Pete Zaitcev:

Every time I think switching to Ruby, smth pops like [Detecting unspecified method arguments in Ruby]

This gives me a platform to spring off from. Python has the same 'problem' as Ruby here, ultimately because their language designers have picked the same solutions to the problem of optional arguments for functions. Roughly speaking, I can think of three ways of handling optional arguments in a language: declare them explicitly, use default values, or explicitly handle at least some of the argument list yourself.

A language that lets you explicitly declare optional argument can also support an explicit and direct check to see which optional arguments were supplied. I believe that Lisps have historically supported this approach. However, this is extra syntax and an extra language feature; both Python and Ruby have opted not to have this and to only support optional function arguments indirectly, through either of the other two approaches.

Handling part of the argument list yourself leaves you to decode it into actual variables (present and absent). This is annoying, so most people wind up using arguments with default values; essentially this pre-decodes the optional portion of the argument list for you. But it does leave you with a problem, one that Lisp people note explicitly in their documentation: since you're using the argument's default value as a signal that the caller didn't supply it, you need a way of distinguishing between the argument not being supplied and your caller supplying a value that happens to also be your 'argument was not supplied' default value.

This is a specific instance of a general situation where you need a sentinel value that can be distinguished from all valid values. Since Python is dynamically typed, often the simplest way to get such a value is to create one yourself and the simplest way to do that is just to create a new instance of object so that you get a unique value:

no_arg = object()
def optarg(a, b=no_arg):
  if b is no_arg:
    ....

(You might as well use is here, because this is one of the rare cases where you really do want object identity instead of object equality.)

You don't need to create your own new sentinel value if you can come up with a convenient existing value that no caller will ever supply (perhaps because it's invalid). None is a popular choice for this role, although I tend to think it's not an ideal option.

(The problem with None as a sentinel value is that it's easy for a None value to creep into your program through various bugs, oversights, or just other functions that return None under various unusual circumstances. The result is a peculiarly hard to spot situation where how the code reads is not how the code actually works; you think that you're calling optarg() with two arguments because that's what's in the source code, but in actuality you're calling it with one. If the effects of this are indirect and only become visible much later this could be quite a head scratcher bug.)

If you use this idiom a lot, it might be worthwhile to create a function so that you can use a clear name:

def unique_value():
  return object()

no_arg = unique_value()

(Trivia: there is a tiny reason to make this a function instead of a subclass of object.)

These sentinel values are not completely ideal; for example, they are not self-documenting if you display them during debugging. But doing better requires more verbosity and repetition, at least in Python.

(This is one of the cases where it would be nice to have a real built in and fully supported symbol type, and yes syntactic sugar does matter.)

(This issue came up in passing before in DefaultArgumentsTrick.)

Sidebar: on unique values in Python

When I say 'unique value', I mean 'some object that we can reliably distinguish from all other objects in the Python universe'. Note that certain sorts of built in values (and thus objects, because in Python everything is an object) that you might think are unique are not in fact unique in CPython, because the CPython interpreter plays tricks behind your back. A full discussion of these tricks is an entry in and of itself, but the short version is that you're safe from them if you use a mutable type. Instances of object are not really mutable in a conventional sense (as I discovered and then much later figured out why), but they're close enough for this.

OptionalArgumentsIssue written at 02:11:47; Add Comment

2011-07-11

Some things to think about when doing polymorphic WSGI

Yesterday I wrote about a WSGI version of cat, but I left out some practical considerations (sometimes I write entries a little bit too fast). In fact these issues are common to all uses of what I've called 'polymorphic WSGI' (and I've alluded to one of them before in passing).

The first complication in a WSGI cat implementation is that you need to figure out how to create and load your WSGI application. As I've become aware, there's actually sort of a standard for this (courtesy of Apache's mod_wsgi if nothing else); you have a chunk of Python code in a file that, when loaded, defines an 'application' object in its namespace. Creating this file is your job as the application writer, but once you have it you can just feed it to wsgi-cat as an argument.

(DWiki vastly predates me becoming aware of this, so it has its own application specific system for configuring its WSGI application interface.)

The second complication is that a WSGI application may care about a lot more of the HTTP request than just the URL; for example, what the Host: header is may matter (and in sophisticated environments, you may need to set cookies and other things). In theory you can supply all of these with command line arguments to your wsgi-cat program; in practice your program really wants to have sensible defaults for your own environment just so that you don't have to invoke it with a pile of arguments all of the time. What those additional bits of information you need are going to depend on your specific application or WSGI framework, but in general the more sophisticated the framework the more random bits of the HTTP request it's probably going to care about. The overkill solution for this is to capture a full WSGI environment from a real browser request and then use it as the default environment (with suitable things modified).

(On the other hand, you actively want to turn off some things by omitting them from the claimed HTTP headers; for example, you probably don't want your wsgi-cat to give you gzip'd output by default.)

You may also want options to fake things like https-based requests in addition to plain HTTP ones. (Locally I'd also want to support HTTP Basic Authentication, or at least the Apache environment variables for it, but that's a peculiarity of our setup that's probably not applicable for most people.)

PolymorphicWSGIIssues written at 00:15:40; Add Comment

2011-07-10

Exploiting polymorphic WSGI again to create cat

I've written before about how I (ab)use what I call the polymorphic nature of WSGI, where a WSGI application doesn't actually care what environment you run it in; anything that will provide a WSGI environment will do. I've used this before for various things and was recently reminded of yet another trick I've played with WSGI.

Suppose that you have a web application and you want to inspect or capture a rendered web page (especially if you want to read the raw HTML). Obviously you can do this from a browser or you can use something simpler like wcat, but one day it struck me: why was I going through the bother of making a web request (and possibly firing up a test web server) just to get some output from my WSGI application?

The result is something that you could call 'wsgi-cat' (although for me it is specific to DWiki). Given a URL on the command line, it connects up all of the WSGI infrastructure necessary, passes the URL to the WSGI application (as a GET request), and then dumps out whatever 'web page' the application returned (which isn't necessarily HTML; it might be a HTTP redirection, for example). I haven't yet given it any support for handling POSTs, but it wouldn't really be hard.

I originally wrote this because I was tired of firing up various things just to look at how I was rendering HTML, but it's turned out to be quietly convenient for any number of things. Looking at HTTP redirects that my application is supposed to generate is one example; many web programs will transparently handle them for you, which is a great convenience right up to the point where you want to inspect them.

(Another potential use is artificially generating error pages or POST response pages and capturing their HTML in order to validate it. Online HTML validators are generally GET-only, which can let validation errors lurk in the POST side of things. (Assuming that you care about validating your HTML at all.))

WSGICatTrick written at 00:53:50; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.