Wandering Thoughts archives

2012-08-12

A nice illustration of the cost of creating and destroying objects in Python

Here are two functions (they may look familiar):

def xiter(max):
  for _ in xrange(max):
    pass

from itertools import repeat
def riter(max):
  for _ in repeat(None, max):
    pass

The first thing that makes these functions interesting is that xiter() runs about 1.25 times as slowly as riter(), once you test with a large enough max to drown out startup effects. This is interesting because these functions are almost identical, both in generated bytecode and in what happens at runtime. Both xrange() and repeat() create and return iterator objects, which the for loop then traverses. Both iterator objects are implemented in C, as are xrange() and repeat() themselves, and the other operations are the same between both functions.

In short, the difference between these two functions is how the two iterator objects behave. And the big difference here is object allocation and deallocation; repeat()'s iterator returns the same pre-existing object max times while xrange()'s iterator returns max different integer objects, most of which are created and then destroyed. While it's possible that the C code for one is much worse than the C code for the other, it's reasonable to assume that this 1.25 times performance difference between the two functions is entirely due to the extra overhead of allocating and deallocating all of those integer objects. This is about as pure an illustration of the cost of object creation and deletion as you could ask for.

(Note that the performance difference is not due to the overhead of having a lot of integer objects around. Only one integer object is alive at any given time in the xiter() version.)

python/ObjectCostIllustrated written at 23:47:22; Add Comment

The CPython bytecode difference between iteration and looping

Because I feel like it, today I'm going to show you the difference in CPython bytecodes between the two variants of 'repeat N times' that I talked about in my entry on the periodic strangeness of idiomatic Python. Both are going to be in functions (because that's the common case of where Python code is).

First, the iteration version:

def iter(max):
  for _ in range(max):
    pass

The core bytecode of the function looks like the following (note that I have simplified the actual bytecode disassembly to make it easier to read):

 0   SETUP_LOOP
 1   LOAD_GLOBAL   range
 2   LOAD_FAST     max
 3   CALL_FUNCTION
 4   GET_ITER
 5   FOR_ITER      (to 8)
 6   STORE_FAST    _
 7   JUMP_ABSOLUTE (to 5)
 8   POP_BLOCK

Bytecodes 1 through 3 are the initial setup, ie the call to range(max) and turning the result into an iterator. The actual for loop over the iterator is three bytecode instructions, 5, 6, and 7; if the loop ran actual code it would appear between 6 and 7.

(When reading the bytecode, it helps to know that CPython bytecode is stack based; arguments to operations are put on the stack (eg, by various LOAD instructions) and then popped off as part of other operations (eg by STORE instructions). The FOR_ITER bytecode includes information about where to resume execution when the iterator is exhausted, in this case at bytecode 8.)

The explicit loop version is:

def loop(max):
  i = 0
  while i < max:
    i += 1

There is a bunch more bytecode here:

 0   LOAD_CONST    0
 1   STORE_FAST    i
 2   SETUP_LOOP
 3   LOAD_FAST     i
 4   LOAD_FAST     max
 5   COMPARE_OP    <
 6   POP_JUMP_IF_FALSE (to 12)
 7   LOAD_FAST     i
 8   LOAD_CONST    1
 9   INPLACE_ADD
10   STORE_FAST    i
11   JUMP_ABSOLUTE (to 3)
12   POP_BLOCK

Setup takes only two bytecodes, to initialize i to 0. But the actual loop involves much more bytecodes; four bytecodes (3-6) are necessary to handle the loop condition and it takes another four (7-10) to increment i (and then a final bytecode, 11, to close the loop). So this code is processing 9 bytecodes every time through the loop, plus it's doing a bunch more stack manipulation than the iteration loop.

Now, you might sensibly ask if this difference in how many bytecodes there are makes any real performance difference; after all, an efficient interpreter can run bytecodes pretty fast. The short answer is that it does; on a 64-bit Linux machine with Python 2.7.3, the loop based version is around 3.4 times slower than the iteration based version. I can make the difference even bigger by changing range() to xrange() in the iterator version, which avoids a chunk of overhead.

(I timed with a max of 10000, which artificially lowers the setup overhead and increases the actual loop overhead.)

Does this performance difference matter? Probably not. In real code you're likely to be looping only for a relatively short number of iterations and the real work you're doing on each iteration will probably dwarf the loop overhead.

PS: due to a comment by David B on the original entry, the most efficient way to do this is:

from itertools import repeat
def itools(max):
  for _ in repeat(None, max):
    pass

This beats even the xrange() based version, and not by a little bit. Why it is better is an interesting thing to think about.

python/IdiomStrangenessII written at 00:54:41; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.