Two problems with Python's file iterators

December 26, 2007

Modern versions of Python let you process each line in a file in a simple way, with just 'for line in fp: ..', replacing either manual while loops with .readline() or the memory inefficiency of letting .readlines() pull the entire thing into memory. But there's two bugs, both of which can be illustrated by running a 'pycat' program:

import sys
for line in sys.stdin:
    sys.stdout.write(line)

If you run this without standard input redirected, you will immediately notice the problems:

  • the program only gets lines from standard input in big blocks, instead of one line at a time.
  • you (almost always) have to give two ^D's to the program before it sees end of file and exits.

Both problems are caused by the same underlying decision: despite using Unix's traditional stdio functions, which do their own buffering, Python adds its own layer of forced buffering for file iteration. This forced buffering even has the perverse effect that you can't mix file iteration and explicit .readline() et al, even if you break out of the iteration loop.

(Since this is a deliberate and long standing design decision, I suspect that the Python people are not interested in bug reports.)

These bugs might seem relatively minor, except that reading from terminals isn't the only case where you really need to handle input a line at a time, without insisting on buffering up a bunch of it; another is dealing with line oriented network protocols.

As a result of running into these issues I reflexively avoid file iteration in my own code, which makes me grumpy when I write yet another 'read the lines' loop. (By now, I have the necessary while pattern memorized.)

Sidebar: coding around the problem

The necessary while pattern for reading from files is:

while 1:
    line = fp.readline()
    if not line:
        break
    ... process line ...
# have reached EOF

Note that even if you want the newline stripped off the end of the line, you do not want to strip it before you do the 'if not line' check; otherwise you will think that blank lines are the end of the file.

(Speaking from personal experience, this is an embarrassing mistake to make, although you usually catch it fast.)

It's also possible to fix things up with an 'iterfile' routine, like this:

def iterfile(fp):
    while 1:
        line = fp.readline()
        if not line:
            return
        yield line

Then instead of 'for line in fp:', just use 'for line in iterfile(fp):'. And of course you can mix this with regular reads from the file without anything getting too confused. You may still have the double EOF problem, depending on how you structure your program; unfortunately, file objects don't remember if they've seen an EOF, so iterfile() itself can't avoid the problem.

Written on 26 December 2007.
« Process memory layout for 32-bit Linux programs
Why I am not entirely fond of Solaris 10 x86's boot archive »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed Dec 26 23:33:12 2007
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.