Two problems with Python's file iterators
Modern versions of Python let you process each line in a file in a
simple way, with just 'for line in fp: ..
', replacing either manual
while
loops with .readline()
or the memory inefficiency of letting
.readlines()
pull the entire thing into memory. But there's two
bugs, both of which can be illustrated by running a 'pycat' program:
import sys for line in sys.stdin: sys.stdout.write(line)
If you run this without standard input redirected, you will immediately notice the problems:
- the program only gets lines from standard input in big blocks, instead of one line at a time.
- you (almost always) have to give two ^D's to the program before it sees end of file and exits.
Both problems are caused by the same underlying decision: despite using
Unix's traditional stdio functions, which do their own buffering, Python
adds its own layer of forced buffering for file iteration. This forced
buffering even has the perverse effect that you can't mix file iteration
and explicit .readline()
et al, even if you break out of the iteration
loop.
(Since this is a deliberate and long standing design decision, I suspect that the Python people are not interested in bug reports.)
These bugs might seem relatively minor, except that reading from terminals isn't the only case where you really need to handle input a line at a time, without insisting on buffering up a bunch of it; another is dealing with line oriented network protocols.
As a result of running into these issues I reflexively avoid file
iteration in my own code, which makes me grumpy when I write yet another
'read the lines' loop. (By now, I have the necessary while
pattern
memorized.)
Sidebar: coding around the problem
The necessary while
pattern for reading from files is:
while 1: line = fp.readline() if not line: break ... process line ... # have reached EOF
Note that even if you want the newline stripped off the end of the line,
you do not want to strip it before you do the 'if not line
' check;
otherwise you will think that blank lines are the end of the file.
(Speaking from personal experience, this is an embarrassing mistake to make, although you usually catch it fast.)
It's also possible to fix things up with an 'iterfile' routine, like this:
def iterfile(fp): while 1: line = fp.readline() if not line: return yield line
Then instead of 'for line in fp
:', just use 'for line in
iterfile(fp):
'. And of course you can mix this with regular reads
from the file without anything getting too confused. You may still have
the double EOF problem, depending on how you structure your program;
unfortunately, file objects don't remember if they've seen an EOF, so
iterfile()
itself can't avoid the problem.
|
|