Wandering Thoughts archives

2007-12-26

Two problems with Python's file iterators

Modern versions of Python let you process each line in a file in a simple way, with just 'for line in fp: ..', replacing either manual while loops with .readline() or the memory inefficiency of letting .readlines() pull the entire thing into memory. But there's two bugs, both of which can be illustrated by running a 'pycat' program:

import sys
for line in sys.stdin:
    sys.stdout.write(line)

If you run this without standard input redirected, you will immediately notice the problems:

  • the program only gets lines from standard input in big blocks, instead of one line at a time.
  • you (almost always) have to give two ^D's to the program before it sees end of file and exits.

Both problems are caused by the same underlying decision: despite using Unix's traditional stdio functions, which do their own buffering, Python adds its own layer of forced buffering for file iteration. This forced buffering even has the perverse effect that you can't mix file iteration and explicit .readline() et al, even if you break out of the iteration loop.

(Since this is a deliberate and long standing design decision, I suspect that the Python people are not interested in bug reports.)

These bugs might seem relatively minor, except that reading from terminals isn't the only case where you really need to handle input a line at a time, without insisting on buffering up a bunch of it; another is dealing with line oriented network protocols.

As a result of running into these issues I reflexively avoid file iteration in my own code, which makes me grumpy when I write yet another 'read the lines' loop. (By now, I have the necessary while pattern memorized.)

Sidebar: coding around the problem

The necessary while pattern for reading from files is:

while 1:
    line = fp.readline()
    if not line:
        break
    ... process line ...
# have reached EOF

Note that even if you want the newline stripped off the end of the line, you do not want to strip it before you do the 'if not line' check; otherwise you will think that blank lines are the end of the file.

(Speaking from personal experience, this is an embarrassing mistake to make, although you usually catch it fast.)

It's also possible to fix things up with an 'iterfile' routine, like this:

def iterfile(fp):
    while 1:
        line = fp.readline()
        if not line:
            return
        yield line

Then instead of 'for line in fp:', just use 'for line in iterfile(fp):'. And of course you can mix this with regular reads from the file without anything getting too confused. You may still have the double EOF problem, depending on how you structure your program; unfortunately, file objects don't remember if they've seen an EOF, so iterfile() itself can't avoid the problem.

FileIteratorProblems written at 23:33:12; Add Comment

2007-12-16

A thought on reading multiline records

I've recently been writing some small programs to digest multiline records which don't have an end of record marker, just a start of record one (in my case, the output of Solaris iostat). People who've written awk probably know the natural structure that results from dealing with this purely line at a time; you wind up with a situation where you accumulate information over the course of the record, and then use the 'start of new record' line to print out everything, reset counters, and so on.

In awk this structure works decently, although it can get unclear and your 'start of line' code can get quite big. It's more problematic in something like Python, because it cuts against the natural subroutine structure of the problem. The obvious structure a subroutine that processes a record, but if you do this you wind up passing it the first line of its record and having it return the first line of the next one.

When I was writing code to do this, it struck me that the way out is to have a specialized file reader that returned a special 'end of record' marker as well as an 'end of file' one. This lets your 'process a record' subroutine just read and process lines until it gets an end of record result. (Internally, the specialized reader has to store the first line of the new record and returns it the next time it's called.)

There's more overall code in the version of my program that uses the specialized reader approach, but it's clearer code so I like it better.

Sidebar: simple record reader code

Here is the code I wound up using for this:

EOR = object()
class RecordReader(object):
    def __init__(self, fo, sre):
        self.pending = None
        self.mre = re.compile(sre)
        self.fo = fo
        self.eof = False
        self.first = True
    def readline(self):
        if self.pending:
            pl = self.pending
            self.pending = None
            return pl
        line = self.fo.readline()
        if not line:
            self.eof = True
            return line
        if self.mre.match(line) and not self.first:
            self.pending = line
            return EOR
        else:
            self.first = False
            return line

It takes a file object and an (uncompiled) regular expression that matches the start of record lines. As I found out the hard way, you need the .first flag so that you do not return a spurious 'end of record' when you read the start of record marker at the start of your file.

Given this, we can write the obvious function to read an entire record:

def readrecord(reader):
    lns = []
    while 1:
        line = reader.readline()
        if not line or line is EOR:
            break
        lns.append(line)
    return lns

The .eof flag is useful to avoid having to propagate end of file status up from the 'process a record' function to your main loop; you can write the main loop as just:

while not reader.eof:
    process_record(reader)
ReadingRecordsThought written at 22:57:13; Add Comment

2007-12-02

How my CGI to CGI/SCGI frontend works

In ExploitingPolymorphicWSGI, I talked about how I use the flexibility of WSGI to run DWiki as either a CGI or a SCGI server by using a small frontend CGI program. Because there are some subtle bits to how this works, I thought I would write down how the CGI frontend works.

The overall logic is:

  • try to talk to an existing SCGI daemon.
  • otherwise, check the load and try to start a daemon if the load is between a minimum and a maximum, and then talk to it.
  • otherwise, if the load is too high, send out an error message about the system being overloaded.
  • otherwise, exec() the CGI version of DWiki.

There are two tricky bits: starting the daemon and handling errors during conversations with the daemon.

If the CGI gets no communication back from the daemon during the SCGI conversations, it decides that something bad has gone wrong and it sends out the overload error message. It can do this because the CGI and the daemon communicate over Unix domain sockets, which lets the daemon get around the socket listen problem; the daemon doesn't abruptly drop connections just because it's shutting down, so any communication issues are serious problems.

(There is no general way to recover from a communication failure with the SCGI daemon, because the CGI may have already consumed part of a POST body and sent it to the daemon. I ran into this exact issue in an earlier version of the CGI and SCGI daemon, where I did not have a clean daemon shutdown and the CGI frontend reacted to communication failures by going on to exec() the CGI version of DWiki.)

The complicated part of starting the daemon is that under load, several CGI processes may all decide that they should start a daemon. This would be bad. To avoid it, CGI processes must obtain a lock (a flock() on a synchronization file) before they try to start the daemon, so that only one can be doing it at once. The full logic is:

  • try to get the lock, which may time out
  • try to get a connection, because another process might have just finished starting the daemon and released its lock. If you get a connection, you're done.
  • if you have the lock but not a connection, this process won the race to be the daemon starter; it forks and execs the SCGI daemon.
  • whether or not you have a lock, loop sleeping for the SCGI daemon to actually start accepting connections; this too may time out.

After all of this, you release the lock if you have it (whether or not you successfully got a connection).

Since starting a daemon on a heavily loaded system may take some time, the CGI has to do at least some waiting. It has timeouts just in case, because at some point it is better for things to go down in flames than keep hammering the system.

(I should really track the child PID and kill it if we started the SCGI daemon but failed to get a connection within the timeout interval, since the attempted invariant is that when you release the lock, either the daemon has been started successfully or it is safe for another process to try.)

Although there are other locking methods than flock(), flock() has the useful property that the lock is guaranteed to evaporate if the process goes away. While I could put locking into the SCGI daemon itself, it's better to put it into the lightweight CGI that is already running than a relatively heavyweight Python program that would have to be started.

(Looking at the code, I see that the SCGI daemon is inheriting the flock() file descriptor. I should probably fix that.)

HowCGIFrontendWorks written at 23:43:47; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.