My favorite way of marking continued lines
January 16, 2013
One of the things you often want when designing configuration files and little domain specific languages is some way of splitting a single long logical line into several physical ones. In other words you want some way of marking line continuations. Over the years people have come up with a huge assortment of ways to do this; you can have a language with explicit terminators and just ignore newlines, you can put backslashes at the end of incomplete lines, and so on.
(Some languages have several different ways of continuing lines, depending on the specific context. Try not to do this in yours if you have a choice.)
As it happens I have a favorite way of doing this and I think it's the best way. It is the 'RFC 822' method (so named because it's how mail headers are handled), where a logical line is continued by indented physical lines. Here is an example:
This is a single logical line once everything is reassembled This is a new logical line
The drawback of this approach is that it becomes harder to make indentation significant in your language. I'd argue that this is not an important drawback for configuration files or small DSLs since you should avoid generally significant indentation because it makes your language parser (much) harder.
The advantage of this approach to me is that it results in continued
lines looking right or at least looking obvious. It's a very
common formatting convention to indent continued lines anyways (even
or especially when not required by the language) and making the
indentation significant for this means that you can't wind up with
indented lines that aren't actually continued (because, for example, you
accidentally left out a
Sidebar: parsing lines in this approach
I believe that the simplest way to parse the resulting language is in a two level process. At the first level you read physical lines, strip blank lines and comments, fold multiple physical lines into a single logical line, and deliver that line to the second level. The second level then parses your actual language. This requires a little bit of care in your first level and you'll need a little pushback stack for lines (since you're going to over-read by one physical line when reading a logical line and the physical line won't always be something you can just discard).
This is not quite a traditional lexer/parser split because your first level doesn't attempt to break up the logical lines into their components, but I try to avoid writing any sort of actual lexer for configuration files and small DSLs. If your situation is complex enough for a real lexer you probably want to handle the entire process in the lexer.
Written on 16 January 2013.
* * *