2011-10-03
On CPython, cell objects, and closures
A commentator on Understanding a tricky bit of Python generators noticed an odd looking thing with generator expressions:
adders = list(lambda x: x+i for i in range(0, 100))
print adders[17](42)
This prints '141'. On the one hand this is quite ordinary (it's just
like other cases of lambdas), but on the other hand things start looking
quite peculiar when you disassemble the generated bytecode for the
lambda expressions and look at the nominal local variables. If we look
at adders[0].__closure__[0]
we see what is printed as something like
<cell at 0xb720e1dc: int object at 0x97ced74>
. The commentator noted:
So when the generator yields its second element, the second lambda function that is constructed, though it is a distinct object itself, uses the same closure cell as the first one does. [...]
I confess I don't understand why the CPython interpreter is doing this. [...] But instead, the same closure cell object is being used each time, though I haven't been able to find where, if anywhere, this closure cell object is stored inside the generator.
To understand what was going on I wound up reading the CPython source code and then thinking about the problem a bit. So let's backtrack to talk about a simpler example:
def mkadder(i): def adder(x): return x+i print i return adder
In a normal function, one that does not create a closure, i
would
be a local variable and so would be stored in the local variable
array in the function's stack frame. Such local
variables do not have an independent existence from the stack frame
itself; when you set or retrieve their value, you are actually doing
'set local variable slot N' or 'get local variable slot N'. The local
variable names are retained only for debugging purposes.
However this presents a problem when we create a closure; because
the closure needs to access i
, we need to keep i
alive after the
function has exited. If i
was still stored in the stack frame we would
have to keep the entire stack frame alive. Even if we dropped all
local variables that were not referred to by the closure we'd still
be hauling around an extra stack frame that we don't really need (or
possibly several stack frames, because you can nest closure-creating
functions).
So what CPython does is that it silently converts i
from a local
variable into an independent cell object. A cell object is nothing
more than a holder for a reference to another object; it is essentially
a variable without a name. Because the cell object has an independent
existence from mkadder
's stack frame it can be incorporated into the
adder
closure by itself (more specifically, it can be referred to by
the closure).
You might ask why CPython goes through all of this extra work to add a
layer of indirection to the closure's reference to i
. After all, why
not simply capture i
's current binding? The answer is that this would
be inconsistent with the treatment of global variables and it would
introduce oddities even for local variables. For example, consider the
following code:
def mkpadder(i, j): def adder(x): return x+i a = adder i = i + j return a t = mkpadder(1, 10) print t(1)
If closures immediately captured the binding of their closed over
variables this would print 2
, while a slight variant version that had
'return adder
' instead would print 12
. Or maybe both versions should
return 2
because the closure would be created the moment that adder
is defined in mkpadder
instead of when it must be materialized (in
either 'a = adder
' or 'return adder
').
So what is happening in the generator is that the loop variable i
is a
cell object instead of a local variable, because it is incorporated into
the lambda
closure. Every lambda closure gets a reference to the same
cell object, and as the for
loop runs the value of the cell object
(ie, the object it points to) keeps changing until at the end it has
the value 99
. The repr()
of cell objects contains both their actual
address (which is constant) and some information about the object that
they currently point to (which may change).
You can see the difference between cell objects and local variables in
the bytecode. Cell variables use LOAD_DEREF
and STORE_DEREF
bytecodes; local variables use LOAD_FAST
and STORE_FAST
.
(Trivia: a long time ago I confidently asserted that
the reason you couldn't write to closed-over variables in a closure was
that the CPython bytecode had no operation to store things into such
variables, although it could read from them. As we can see from the
existence of STORE_DEREF
, I was completely wrong. CPython simply
just doesn't do it for its own reasons.)
Sidebar: why CPython doesn't use a real namespace for this
Given that cell objects are a lot of work essentially to fake a namespace, one plausible alternate approach is for CPython to create a real namespace for closed over variables instead. There are a couple of reasons to avoid this.
First, it wouldn't necessarily be one namespace. A single function can create multiple closures, each of which can use a different subset of the function's local variables. Plus you have the issue of closures that themselves create more closures that use a subset of the closed over variables.
Second, it would be less efficient than the current approach. Right now, pointers to the cell objects for any particular chunk of code are stored in an array (basically just like local variables) and are set and retrieved by index into the array, instead of by name lookups in a namespace dictionary. This makes referring to closed over variables only slightly slower than referring to local variables (it adds an extra dereference through the cell object). Changing to name lookups in a dictionary (as if they were global variables) would be considerably slower.
My idea of how a modern mailing service should work
From one perspective, I can totally understand why small companies want to outsource handling outgoing mail to a dedicated mail provider. The days when you could just install a MTA, plug in some settings, and be done are long over; these days doing a decent job of sending mail and getting it delivered to as many places as possible requires a significant amount of specialized expertise, and the expertise goes up if you want to use HTML mail. You could learn all of this, but why? It's better to outsource and let full-time specialists handle it for you.
On the other hand, as a sysadmin on the receiving end of these mail services I have some issues. Specifically, they get abused by spammers and they have a strong incentive to spend as little money as they can get away with on preventing this (money spent preventing spam is pure expense). On average, the only contact I have with a mailing service is being sent some form of spam (there are many mailing services and I don't sign up with very many places that use them).
Thus I have formed a theory about how such a modern mailing service should work: normally and by default it should proxy outgoing email through your server, using a dedicated proxy agent (not an MTA that you set up). All of the hard work would still be done by the mailing service on their machines and you would continue interacting with them as normal; it's just that the final delivery would emerge from your machine, on your IP address, instead of directly from one of their IP addresses.
The advantage for everyone is that this would make your mail unambiguously your mail, and avoid any contamination with other people who are also using the mailing service provider. The mailing service provider would effectively become less of a provider of mail and somewhat more a provider of mail handling software (and expertise), software that just happened to run on their servers as a service.
This clearly doesn't work for everyone in all situations, so the mailing service would still have an option to send out the mail for you. But I think that 'the mail comes out your IP address' should be the default starting case.
(Since this is the era of running companies out of AWS, it's possible that I'm drastically underestimating how many people would need the mailing service to send out email for them; maybe you simply can no longer assume that people have dedicated IP addresses in address space that hasn't been badly abused and contaminated.)