Wandering Thoughts archives

2006-11-30

How Python parses indentation

One of the interesting things about Python for compiler geeks is that it is a partially implicit context language, in that line indentation is significant. Implicit context languages been out of favour for a very long time, and most parsing techniques these days are geared towards stream oriented grammars.

(By 'implicit context languages' I mean ones where state changes like entering and exiting blocks are implicit in the difference between lines, such as different indentation levels. By contrast, stream oriented languages use explicit markers for such state changes, like { and } in C.)

Python deals with this in the tokenizer, which transforms changes in indentation level into synthetic INDENT and DEDENT tokens. One consequence of this is that the tokenizer is what enforces the rule that when you dedent you have to return to an existing previous indentation level, not something between one and another.

When I looked at Python's actual grammar (in human-readable form in Grammar/Grammar in the source distribution), I got a surprise: there is almost no mention of INDENT and DEDENT tokens (and NEWLINE). In particular, they're not mentioned in the definitions of potentially multi-line things like lists. It turns out that this is because the tokenizer silently swallows indentation changes and newlines while inside ('s, ['s, and {'s.

Note particularly that this applies to all occurrences of '('. The following is not a syntax error:

def foo(a, b,
           c, d,

 e, f):
  pass

(I can't say it's a good idea, though.)

This is not documented in the relevant bit of the Python language reference, so counting on it is unwise.

python/PythonIndentationParsing written at 23:31:24; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.