== The CPython bytecode difference between iteration and looping Because I feel like it, today I'm going to show you the difference in CPython bytecodes between the two variants of 'repeat N times' that I talked about in [[my entry on the periodic strangeness of idiomatic Python IdiomStrangeness]]. Both are going to be in functions (because that's the common case of where Python code is). .pn prewrap on First, the iteration version: > def iter(max): > for _ in range(max): > pass The core bytecode of the function looks like the following (note that I have simplified the actual bytecode disassembly to make it easier to read): > 0 SETUP_LOOP > 1 LOAD_GLOBAL range > 2 LOAD_FAST max > 3 CALL_FUNCTION > 4 GET_ITER > 5 FOR_ITER (to 8) > 6 STORE_FAST _ > 7 JUMP_ABSOLUTE (to 5) > 8 POP_BLOCK Bytecodes 1 through 3 are the initial setup, ie the call to _range(max)_ and turning the result into an iterator. The actual _for_ loop over the iterator is three bytecode instructions, 5, 6, and 7; if the loop ran actual code it would appear between 6 and 7. (When reading the bytecode, it helps to know that CPython bytecode is stack based; arguments to operations are put on the stack (eg, by various LOAD instructions) and then popped off as part of other operations (eg by STORE instructions). The ((FOR_ITER)) bytecode includes information about where to resume execution when the iterator is exhausted, in this case at bytecode 8.) The explicit loop version is: > def loop(max): > i = 0 > while i < max: > i += 1 There is a bunch more bytecode here: > 0 LOAD_CONST 0 > 1 STORE_FAST i > 2 SETUP_LOOP > 3 LOAD_FAST i > 4 LOAD_FAST max > 5 COMPARE_OP < > 6 POP_JUMP_IF_FALSE (to 12) > 7 LOAD_FAST i > 8 LOAD_CONST 1 > 9 INPLACE_ADD > 10 STORE_FAST i > 11 JUMP_ABSOLUTE (to 3) > 12 POP_BLOCK Setup takes only two bytecodes, to initialize _i_ to 0. But the actual loop involves much more bytecodes; four bytecodes (3-6) are necessary to handle the loop condition and it takes another four (7-10) to increment _i_ (and then a final bytecode, 11, to close the loop). So this code is processing 9 bytecodes every time through the loop, plus it's doing a bunch more stack manipulation than the iteration loop. Now, you might sensibly ask if this difference in how many bytecodes there are makes any real performance difference; after all, an efficient interpreter can run bytecodes pretty fast. The short answer is that it does; on a 64-bit Linux machine with Python 2.7.3, the loop based version is around 3.4 times slower than the iteration based version. I can make the difference even bigger by changing _range()_ to _xrange()_ in the iterator version, which avoids a chunk of overhead. (I timed with a _max_ of 10000, which artificially lowers the setup overhead and increases the actual loop overhead.) Does this performance difference matter? Probably not. In real code you're likely to be looping only for a relatively short number of iterations and the real work you're doing on each iteration will probably dwarf the loop overhead. PS: due to a comment by David B on [[the original entry IdiomStrangeness]], the most efficient way to do this is: > from itertools import repeat > def itools(max): > for _ in repeat(None, max): > pass This beats even the _xrange()_ based version, and not by a little bit. Why it is better is an interesting thing to think about.