Iterator & Generator Gotchas
Python iterators are objects (or functions, using some magic) that repeatedly produce values, one at a time, until they get exhausted. Python introduced this general feature to efficiently support things like:
for line in fp.readlines(): ... do something with each line ...
Without iterators, .readlines()
would have to read the entire file
into memory, split it up into lines, and return a huge list; now, this
code only has one line in memory at any given time, even if the file
is tens or hundreds of megabytes.
Generators are functions that magically create iterators instead of just returning values (ignoring some technicalities). Generators are the most common gateway to iterators, and are thus the more commonly used term for the whole area.
When iterators were introduced, a number of standard things that had previously returned lists started returning iterators, and using a generator instead of just returning a list became part of the common Python programming idioms.
In many cases it can be tempting, and temptingly easy, to replace things that return lists with generators; it looks like it should just work, and it mostly does. It can be similarly tempting to just ignore the difference in the standard Python modules.
But there are some gotchas when you write code like this, and I have the stubbed toes to prove it. At one point or another, I've made all of these iterator-confusion mistakes in my code.
Iterators are always true
t = generate_list(some, inputs) if not t: return print "Header Line:" for item in t: .....
If generate_list
returns an iterator instead of a list, this code
doesn't work right. Unless someone got quite fancy, iterator objects
are always true, unlike lists, which are only true if they contain
something.
There's really no way to see if an iterator contains anything except to try to get a value from it. And there's no 'push value back onto iterator' operation.
Iterators can't be saved
def cached_lookup(what): if what not in cache: cache[what] = real_lookup(what) return cache[what]
If real_lookup
returns iterators, this code doesn't work.
When an iterator's exhausted, it's exhausted; if you try to use it
again (such as if cached_lookup
found it as a cached result), it
generates nothing.
(Technically I believe there are semi-magical ways to copy iterators. I suspect one is best off avoiding them unless you really have to save an iterator copy.)
I can't use list methods on iterators
t = generate_list(some, inputs) t.sort() t = t[:firstN] # ... admire the pretty explosions
Of course, iterators don't have general list functions like .sort()
(or .len()
, or so on). If you want to use those functions, you have
to write:
t = list(generate_list(some, inputs)) t.sort(); t = t[:firstN]
Fortunately, list()
will expand the iterator for you and is
harmless to apply to real lists, so you can use it without having to
care if the generate_list
routine changes what it returns.
Writing recursive generators
Sometimes the most natural structure for a generator is a recursive one. This works, but you have to bear in mind a twist: you cannot simply return the results of the recursive calls. This is because the recursive results are themselves iterators, and if you return them straight your callers get iterators that produce a stream of iterators that produce a stream of iterators that someday, at some level, produce actual results. (But by that time the caller has given up in despair.)
Instead each time you recurse, you have to expand the resulting iterator and return each result, like so:
def treewalk(node): if not node: return yield node.value for val in treewalk(node.left): yield val for val in treewalk(node.right): yield val
This implies that significantly recursive generators can be quite inefficient, as they will spend a great deal of time trickling results up through all the levels involved.
|
|