A thought on reading multiline records
I've recently been writing some small programs to digest multiline
records which don't have an end of record marker, just a start of record
one (in my case, the output of Solaris iostat). People who've written
awk probably know the natural structure that results from dealing with
this purely line at a time; you wind up with a situation where you
accumulate information over the course of the record, and then use the
'start of new record' line to print out everything, reset counters, and
so on.
In awk this structure works decently, although it can get unclear and your 'start of line' code can get quite big. It's more problematic in something like Python, because it cuts against the natural subroutine structure of the problem. The obvious structure a subroutine that processes a record, but if you do this you wind up passing it the first line of its record and having it return the first line of the next one.
When I was writing code to do this, it struck me that the way out is to have a specialized file reader that returned a special 'end of record' marker as well as an 'end of file' one. This lets your 'process a record' subroutine just read and process lines until it gets an end of record result. (Internally, the specialized reader has to store the first line of the new record and returns it the next time it's called.)
There's more overall code in the version of my program that uses the specialized reader approach, but it's clearer code so I like it better.
Sidebar: simple record reader code
Here is the code I wound up using for this:
EOR = object()
class RecordReader(object):
def __init__(self, fo, sre):
self.pending = None
self.mre = re.compile(sre)
self.fo = fo
self.eof = False
self.first = True
def readline(self):
if self.pending:
pl = self.pending
self.pending = None
return pl
line = self.fo.readline()
if not line:
self.eof = True
return line
if self.mre.match(line) and not self.first:
self.pending = line
return EOR
else:
self.first = False
return line
It takes a file object and an (uncompiled) regular expression that
matches the start of record lines. As I found out the hard way, you need
the .first flag so that you do not return a spurious 'end of record'
when you read the start of record marker at the start of your file.
Given this, we can write the obvious function to read an entire record:
def readrecord(reader):
lns = []
while 1:
line = reader.readline()
if not line or line is EOR:
break
lns.append(line)
return lns
The .eof flag is useful to avoid having to propagate end of file
status up from the 'process a record' function to your main loop;
you can write the main loop as just:
while not reader.eof:
process_record(reader)
|
|