A thought on reading multiline records

December 16, 2007

I've recently been writing some small programs to digest multiline records which don't have an end of record marker, just a start of record one (in my case, the output of Solaris iostat). People who've written awk probably know the natural structure that results from dealing with this purely line at a time; you wind up with a situation where you accumulate information over the course of the record, and then use the 'start of new record' line to print out everything, reset counters, and so on.

In awk this structure works decently, although it can get unclear and your 'start of line' code can get quite big. It's more problematic in something like Python, because it cuts against the natural subroutine structure of the problem. The obvious structure a subroutine that processes a record, but if you do this you wind up passing it the first line of its record and having it return the first line of the next one.

When I was writing code to do this, it struck me that the way out is to have a specialized file reader that returned a special 'end of record' marker as well as an 'end of file' one. This lets your 'process a record' subroutine just read and process lines until it gets an end of record result. (Internally, the specialized reader has to store the first line of the new record and returns it the next time it's called.)

There's more overall code in the version of my program that uses the specialized reader approach, but it's clearer code so I like it better.

Sidebar: simple record reader code

Here is the code I wound up using for this:

EOR = object()
class RecordReader(object):
    def __init__(self, fo, sre):
        self.pending = None
        self.mre = re.compile(sre)
        self.fo = fo
        self.eof = False
        self.first = True
    def readline(self):
        if self.pending:
            pl = self.pending
            self.pending = None
            return pl
        line = self.fo.readline()
        if not line:
            self.eof = True
            return line
        if self.mre.match(line) and not self.first:
            self.pending = line
            return EOR
            self.first = False
            return line

It takes a file object and an (uncompiled) regular expression that matches the start of record lines. As I found out the hard way, you need the .first flag so that you do not return a spurious 'end of record' when you read the start of record marker at the start of your file.

Given this, we can write the obvious function to read an entire record:

def readrecord(reader):
    lns = []
    while 1:
        line = reader.readline()
        if not line or line is EOR:
    return lns

The .eof flag is useful to avoid having to propagate end of file status up from the 'process a record' function to your main loop; you can write the main loop as just:

while not reader.eof:
Written on 16 December 2007.
« There are reasons for stupid anti-spam policies
What is a script language on Unix »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sun Dec 16 22:57:13 2007
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.