== On CPython, cell objects, and closures A commentator on [[Understanding a tricky bit of Python generators TrickyPythonBinding]] noticed an odd looking thing with generator expressions: > _adders = list(lambda x: x+i for i in range(0, 100))_ \\ > _print adders[17](42)_ This prints '141'. On the one hand this is quite ordinary (it's just like other cases of lambdas), but on the other hand things start looking quite peculiar when you disassemble the generated bytecode for the lambda expressions and look at the nominal local variables. If we look at ((adders[0].__closure__[0])) we see what is printed as something like __. The commentator noted: > So when the generator yields its second element, the second lambda > function that is constructed, though it is a distinct object itself, > uses the same closure cell as the first one does. [...] > > I confess I don't understand why the CPython interpreter is doing > this. [...] But instead, the same closure cell object is being used > each time, though I haven't been able to find where, if anywhere, this > closure cell object is stored inside the generator. To understand what was going on I wound up reading the CPython source code and then thinking about the problem a bit. So let's backtrack to talk about a simpler example: > def mkadder(i): > def adder(x): > return x+i > print i > return adder In a normal function, one that does not create a closure, _i_ would be a local variable and so would be stored in [[the local variable array WhyLocalVarsAreFast]] in the function's stack frame. Such local variables do not have an independent existence from the stack frame itself; when you set or retrieve their value, you are actually doing 'set local variable slot N' or 'get local variable slot N'. The local variable names are retained only for debugging purposes. However this presents a problem when we create a closure; because the closure needs to access _i_, we need to keep _i_ alive after the function has exited. If _i_ was still stored in the stack frame we would have to keep the entire stack frame alive. Even if we dropped all local variables that were not referred to by the closure we'd still be hauling around an extra stack frame that we don't really need (or possibly several stack frames, because you can nest closure-creating functions). So what CPython does is that it silently converts _i_ from a local variable into an independent *cell object*. A cell object is nothing more than a holder for a reference to another object; it is essentially a variable without a name. Because the cell object has an independent existence from _mkadder_'s stack frame it can be incorporated into the _adder_ closure by itself (more specifically, it can be referred to by the closure). You might ask why CPython goes through all of this extra work to add a layer of indirection to the closure's reference to _i_. After all, why not simply capture _i_'s current binding? The answer is that this would be inconsistent with the treatment of global variables and it would introduce oddities even for local variables. For example, consider the following code: > def mkpadder(i, j): > def adder(x): > return x+i > a = adder > i = i + j > return a > > t = mkpadder(1, 10) > print t(1) If closures immediately captured the binding of their closed over variables this would print _2_, while a slight variant version that had '_return adder_' instead would print _12_. Or maybe both versions should return _2_ because the closure would be created the moment that _adder_ is defined in _mkpadder_ instead of when it must be materialized (in either '_a = adder_' or '_return adder_'). So what is happening in the generator is that the loop variable _i_ is a cell object instead of a local variable, because it is incorporated into the _lambda_ closure. Every lambda closure gets a reference to the same cell object, and as the _for_ loop runs the value of the cell object (ie, the object it points to) keeps changing until at the end it has the value _99_. The _repr()_ of cell objects contains both their actual address (which is constant) and some information about the object that they currently point to (which may change). You can see the difference between cell objects and local variables in the bytecode. Cell variables use ((LOAD_DEREF)) and ((STORE_DEREF)) bytecodes; local variables use ((LOAD_FAST)) and ((STORE_FAST)). (Trivia: [[a long time ago ClosureProblem]] I confidently asserted that the reason you couldn't write to closed-over variables in a closure was that the CPython bytecode had no operation to store things into such variables, although it could read from them. As we can see from the existence of ((STORE_DEREF)), I was completely wrong. CPython simply just doesn't do it for its own reasons.) === Sidebar: why CPython doesn't use a real namespace for this Given that cell objects are a lot of work essentially to fake a namespace, one plausible alternate approach is for CPython to create a real namespace for closed over variables instead. There are a couple of reasons to avoid this. First, it wouldn't necessarily be one namespace. A single function can create multiple closures, each of which can use a different subset of the function's local variables. Plus you have the issue of closures that themselves create more closures that use a subset of the closed over variables. Second, it would be less efficient than the current approach. Right now, pointers to the cell objects for any particular chunk of code are stored in an array (basically just like local variables) and are set and retrieved by index into the array, instead of by name lookups in a namespace dictionary. This makes referring to closed over variables only slightly slower than referring to local variables (it adds an extra dereference through the cell object). Changing to name lookups in a dictionary (as if they were global variables) would be considerably slower.