More on my favorite way of marking continued lines
A commentator on my first entry on this both correctly noted that I had mis-attributed the RFC that originated this (I learned it from RFC 822, but it originally was invented in RFC 724) and had some reactions to my idea, which means that I need to clarify it and add some additional comments. They wrote:
Comment-folding whitespace? Please, no. No. :( Comment-folding whitespace is the bane of people handling email.
In a sense, I entirely agree with this comment. Implementing full RFC 724/RFC 822 style parsing in your language is not what you want to do because it's too complex and perverse (mail headers have some crazy rules). But I was unclear in my original entry, especially about comments.
In a 'leading whitespace continues the logical line' environment, my usual approach to comments is that they occupy whole physical lines (ie you cannot have a line that is part-content and part-comment) and are silently removed in low-level parsing. As an example:
# this is a commentabc # this is notthis is # a comment some text this is some text
The last two things result in the same logical line ('this is some text') because the (indented) comment line is removed as part of assembling logical lines. There are many equally good variants on comment handling (eg disallow them in continued lines); I just find it convenient to be able to write comments for parts of anything that gets long enough to be split over multiple physical lines.
(As implied by the reassembled line, my approach is to replace all of the trailing whitespace, the newline, and the leading whitespace with a single space.)
As implied by how I prefer to handle comments, this is all designed for simple situations, for configuration files and small DSLs with grammars that are as simple as possible (often simply 'space separated words' with some meaning layered on top). It's my strong belief that all of these languages already want to avoid language features that might make this sort of line continuation a problem (although I'm not sure what they would be). Yes, people can break logical lines up in perverse ways with this, but they can do that with any line continuation scheme (and you still want a line continuation scheme).
(As I have found out the hard way repeatedly, line continuations are something you almost always want to have, much like comments.)
If you're doing this as part of a real lexer and tokenizer, you will have to decide what happens with a single token that gets split over multiple physical lines, such as:
a = "some text"
Because I do this before any tokenization gets its hands on the result,
my answer is 'what you see is what you get', ie the language tokenizer
and parser gets handed '
a = "some text"' and may do with it whatever
it wishes. This is not necessarily suitable for sophisticated languages
which may sometimes want to retain newlines and leading whitespace as
actual elements of eg strings, but as I said this is a design for simple