My favorite way of marking continued lines

January 16, 2013

One of the things you often want when designing configuration files and little domain specific languages is some way of splitting a single long logical line into several physical ones. In other words you want some way of marking line continuations. Over the years people have come up with a huge assortment of ways to do this; you can have a language with explicit terminators and just ignore newlines, you can put backslashes at the end of incomplete lines, and so on.

(Some languages have several different ways of continuing lines, depending on the specific context. Try not to do this in yours if you have a choice.)

As it happens I have a favorite way of doing this and I think it's the best way. It is the 'RFC 822' method (so named because it's how mail headers are handled), where a logical line is continued by indented physical lines. Here is an example:

This is a single
     logical line
     once everything is
     reassembled
This is a new logical line

The drawback of this approach is that it becomes harder to make indentation significant in your language. I'd argue that this is not an important drawback for configuration files or small DSLs since you should avoid generally significant indentation because it makes your language parser (much) harder.

The advantage of this approach to me is that it results in continued lines looking right or at least looking obvious. It's a very common formatting convention to indent continued lines anyways (even or especially when not required by the language) and making the indentation significant for this means that you can't wind up with indented lines that aren't actually continued (because, for example, you accidentally left out a \ at the end of the previous line; I've done this more than once in things like Makefiles).

Sidebar: parsing lines in this approach

I believe that the simplest way to parse the resulting language is in a two level process. At the first level you read physical lines, strip blank lines and comments, fold multiple physical lines into a single logical line, and deliver that line to the second level. The second level then parses your actual language. This requires a little bit of care in your first level and you'll need a little pushback stack for lines (since you're going to over-read by one physical line when reading a logical line and the physical line won't always be something you can just discard).

This is not quite a traditional lexer/parser split because your first level doesn't attempt to break up the logical lines into their components, but I try to avoid writing any sort of actual lexer for configuration files and small DSLs. If your situation is complex enough for a real lexer you probably want to handle the entire process in the lexer.


Comments on this page:

From 76.124.106.113 at 2013-01-18 01:14:17:

Comment-folding whitespace? Please, no. No. :( Comment-folding whitespace is the bane of people handling email. Please watch http://youtu.be/JENdgiAPD6c?t=6m25s for an explanation of why this is a bad idea and how it will make your life painful.

And as the video shows, it's not RFC 822-style, it's RFC 724-style. But either way, don't. Please. :(

From 87.79.236.202 at 2013-01-18 01:55:44:

Err, and the comment-folding part in Chris’ whitespace proposal is where exactly?

Aristotle Pagaltzis

By cks at 2013-01-18 02:32:21:

Having watched the headers portion of the video, I don't think that this is a problem. RFC 724/822/etc are more complex than a typical language because they allow nested comments (with some exceptions), but they are otherwise not unusual; you can split logical lines in exactly the same perverse way in any language that allows line continuations and C-like comments in the middle of lines.

(In some languages they are called 'statements' instead of 'logical lines'.)

If you want to parse email headers with regular expressions, you do have problems. But this is true for parsing any complex-grammar language. Simple line continuation rules like mine don't stop you from parsing a simple language with regexps; you just parse the reassembled logical lines instead of the physical lines.

(I wrote more about some aspects of this in FavoriteLineContinuationII.)

By cks at 2013-01-18 02:43:54:

A quick note:

Err, and the comment-folding part in Chris’ whitespace proposal is where exactly?

If I'm understanding 'comment-folding' correctly, it's implied by my description of the line parsing process (where I mention stripping comments during the reassembly of logical lines). As I made explicit in FavoriteLineContinuationII (and should have covered in this entry), I do allow (and ignore) physical line comment lines in continued logical lines.

From 87.79.236.202 at 2013-01-18 07:43:28:

The disattraction of CFWS is that it allows (nested!) delimited comments anywhere that whitespace is permitted. Among other degenerate cases, this means a line which does not itself look like a comment may nevertheless be part of one.

That is not the case in your proposal. And even if you were to lift the restriction to full-line comments and allow for line-trailing comments, still it wouldn’t be the case.

There is no way to parse CFWS other than with a parser (or some post-regular expression dialect like Perl’s capable of real nested matching). More specifically and worse, you cannot express the folding in terms of a normalising pre-processing step in order to keep the next step simple – you must shove the ugliness into the grammar.

That is not the case in your proposal. (And even if you etc.) In fact your primary stated aim is for this to be possible.

Aristotle Pagaltzis

Written on 16 January 2013.
« How I drafted (okay, wrote) an entry in public by accident
More on my favorite way of marking continued lines »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed Jan 16 22:53:28 2013
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.