A thought on reading multiline records
I've recently been writing some small programs to digest multiline
records which don't have an end of record marker, just a start of record
one (in my case, the output of Solaris
iostat). People who've written
awk probably know the natural structure that results from dealing with
this purely line at a time; you wind up with a situation where you
accumulate information over the course of the record, and then use the
'start of new record' line to print out everything, reset counters, and
In awk this structure works decently, although it can get unclear and your 'start of line' code can get quite big. It's more problematic in something like Python, because it cuts against the natural subroutine structure of the problem. The obvious structure a subroutine that processes a record, but if you do this you wind up passing it the first line of its record and having it return the first line of the next one.
When I was writing code to do this, it struck me that the way out is to have a specialized file reader that returned a special 'end of record' marker as well as an 'end of file' one. This lets your 'process a record' subroutine just read and process lines until it gets an end of record result. (Internally, the specialized reader has to store the first line of the new record and returns it the next time it's called.)
There's more overall code in the version of my program that uses the specialized reader approach, but it's clearer code so I like it better.
Sidebar: simple record reader code
Here is the code I wound up using for this:
EOR = object() class RecordReader(object): def __init__(self, fo, sre): self.pending = None self.mre = re.compile(sre) self.fo = fo self.eof = False self.first = True def readline(self): if self.pending: pl = self.pending self.pending = None return pl line = self.fo.readline() if not line: self.eof = True return line if self.mre.match(line) and not self.first: self.pending = line return EOR else: self.first = False return line
It takes a file object and an (uncompiled) regular expression that
matches the start of record lines. As I found out the hard way, you need
.first flag so that you do not return a spurious 'end of record'
when you read the start of record marker at the start of your file.
Given this, we can write the obvious function to read an entire record:
def readrecord(reader): lns =  while 1: line = reader.readline() if not line or line is EOR: break lns.append(line) return lns
.eof flag is useful to avoid having to propagate end of file
status up from the 'process a record' function to your main loop;
you can write the main loop as just:
while not reader.eof: process_record(reader)