Understanding a tricky bit of Python generators

October 1, 2011

From this Python quiz's third question, consider the following code:

units = [1, 2]
tens = [10, 20]
nums = (a + b for a in units for b in tens)
units = [3, 4]
tens = [30, 40]
print nums.next()

This is a far more interesting quiz question than you might think, because what's going on is actually quite deep.

One of the things that gets people in Python is that it is in general what I've called a 'late binding' language; when you write an expression whose execution is deferred, the values of the variables it uses are not immediately captured. Instead they will be looked up when the code is actually executed. This shows up in the quiz's first question, for example. A straightforward interpretation of late binding might expect the code here to print 33. Instead it prints 31; tens is late binding, referring to the current value, but units has been bound immediately.

You may now be going 'say what?' So let me make your day a little bit more surreal:

units = [1, 2]
tens = [10, 20]
nums = (a + b for a in units for b in tens)
units.pop(0)
tens = [30, 40]
print nums.next()

This prints 32.

To explain this, let me quote from the language specification:

Variables used in the generator expression are evaluated lazily when the __next__() method is called for generator object (in the same fashion as normal generators). However, the leftmost for clause is immediately evaluated, [...]

However, 'evaluates' here does not mean what you might think. When Python 'evaluates' units in the 'for a in units' clause, it doesn't make a private copy of the list's value; instead, it creates an iterator object from the list. This iterator object is what the for loop actually loops over, and it internally has a reference to the original list that units was bound to. The first version of this question rebinds units but leaves the original list (now accessible only through the iterator) unaltered. nums.next() thus uses the first element of the original list as a. The second version mutates the original list by deleting the first element, so nums.next() winds up using '2' as a.

(In both cases the second for loop is un-evaluated until the generator begins operation, so it picks up the new binding for tens.)

I am in admiration of how deep this rabbit hole turned out to be once I actually started looking down it.

Sidebar: seeing this in the bytecode

To confirm what's happening, let's disassemble nums.gi_frame.f_code, the actual bytecode of the generator (I've somewhat simplified the bytecode disassembly syntax):

 0 SETUP_LOOP           (to 42)
 3 LOAD_FAST   '.0'
 6 FOR_ITER             (to 41)
 9 STORE_FAST  'a'
12 SETUP_LOOP           (to 38)
15 LOAD_GLOBAL 'tens'
18 GET_ITER
19 FOR_ITER             (to 37)
[...]

As we can see here, there is no reference to units; instead we do a load of a local variable called .0. If we check nums.gi_frame.f_locals['.0'], we'll see that this local variable is a 'listiterator' object. By contrast, tens is loaded explicitly as a global and then immediately turned into an iterator so that it can be looped over.

Note that you can get truly odd behavior by rebinding tens after you've called nums.next() once (or otherwise after you've invoked the generator once). This is because every time we go through the outer loop, the current binding of tens is re-captured into an iterator for the inner loop. Further rebinding of tens has a delayed effect; it takes effect on the next pass around the outer loop.

(Mutation of the current binding has an immediate effect, but is then lost if you've rebound tens as well.)

PS: you can make similar crazy things happen in conventional for loops, because the same object to iterator transformation is happening.


Comments on this page:

From 24.125.177.176 at 2011-10-02 14:56:12:

This appears to be even weirder than I thought on first reading your post. Take quiz question #1, the second version:

adders = [lambda x: x+i for i in range(0, 100)]
print adders[17](42)

I see why this prints 141; the "i" variable leaks into the global scope, so each lambda is seeing the final global value of i, which is 99.

However, try this with a generator instead of a list comprehension:

adders = list(lambda x: x+i for i in range(0, 100))
print adders[17](42)

Remarkably, this also prints 141! Here the i doesn't leak into global scope; instead, each lambda forms a closure (as can be seen by inspecting the bytecode), but somehow, the last i value, 99, is stored in each closure cell! This has nothing to do with lazy evaluation of iterators: the contents of each closure cell is just an integer object, not a reference to anything.

The only way to get the desired behavior is to force the closure into a separate lexical scope with a helper function:

def lmaker(i):
    return lambda x: x + i

adders = [lmaker(i) for i in range(0, 100)]
print adders[17](42)

This prints out 59.

By cks at 2011-10-02 23:56:00:

The same thing is happening with your generator version: all versions of the lambda refer to the same variable, it's just that the variable doesn't have a name. Since this is using a generator, we can actually see this happen 'live':

lnums = (lambda x: x+i for i in range(0, 100))
n0 = lnums.next()
print n0(10)
n1 = lnums.next()
print n0(10)

This will print 10 and then 11.

(I'm not sure how closure cells are implemented and just what is going on in the depths of CPython, but it's clear that this is what's happening on a Python-level basis.)

From 24.125.177.176 at 2011-10-03 01:36:22:

Following on from what you just tried, if you look at the actual closure cells, it's obvious how the variable reference is being implemented. I'll post the console session in doctest format so you can see the actual output:

>>> lnums = (lambda x: x+i for i in range(0, 100))
>>> n0 = lnums.next()
>>> print n0
<function <lambda> at 0xb720f304>
>>> print n0.__closure__[0]
<cell at 0xb720e1dc: int object at 0x97ced74>
>>> n0(10)
10
>>> n1 = lnums.next()
>>> print n1
<function <lambda> at 0xb720f33c>
>>> print n1.__closure__[0]
<cell at 0xb720e1dc: int object at 0x97ced68>
>>> n1(10)
11
>>> print n0
<function <lambda> at 0xb720f304>
>>> print n0.__closure__[0]
<cell at 0xb720e1dc: int object at 0x97ced68>
>>> n0(10)
11

So when the generator yields its second element, the second lambda function that is constructed, though it is a distinct object itself, uses the same closure cell as the first one does. And, of course, that closure cell now gets clobbered with the second value of i, so both lambdas now have a 1 stored there.

I confess I don't understand why the CPython interpreter is doing this. The lambda stored inside the generator itself (in lnums.gi_code.co_consts) is just a code object; it has the free variable 'i' but no closure cell bound to it in the stored constant. So it certainly looks like a closure cell ought to be constructed for it each time the generator constructs a new function object to yield as a value, especially since it's obvious from the above that a new integer value is being produced each time. But instead, the same closure cell object is being used each time, though I haven't been able to find where, if anywhere, this closure cell object is stored inside the generator. This seems like a very odd corner in the CPython closure implementation.

From 24.125.177.176 at 2011-10-03 19:53:03:

Sorry to keep cluttering up your comments, but a post on StackOverflow gave me another way to capture the desired binding in the lambda:

>>> lnums = (lambda x, j=i: x+j for i in range(0, 100))
>>> n0 = lnums.next()
>>> n0(10)
10
>>> n1 = lnums.next()
>>> n1(10)
11
>>> n0(10)
10

Here there is no closure, just an extra default argument, and that is evaluated when the lambda is created. This helps me understand a little better why the previous examples work the way they do; I was thinking of the lambda expression as implicitly defining a lexical scope the way the helper function did, but it doesn't.

StackOverflow post here:

http://stackoverflow.com/questions/139819/why-results-of-map-and-list-comprehension-are-different

I also came across another blog post that goes into this issue in more detail:

http://lackingrhoticity.blogspot.com/2009/04/python-variable-binding-semantics-part.html

Ok, I'll stop now. :-)

By cks at 2011-10-03 22:16:47:

My reply about what's going on here got sufficiently long that I turned it into an entry, CPythonCellsClosures.

Written on 01 October 2011.
« Some thoughts from a close view of catastrophe
Another comment spam precaution that no longer works out »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sat Oct 1 01:30:19 2011
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.