How Python parses indentation
One of the interesting things about Python for compiler geeks is that it is a partially implicit context language, in that line indentation is significant. Implicit context languages been out of favour for a very long time, and most parsing techniques these days are geared towards stream oriented grammars.
(By 'implicit context languages' I mean ones where state changes like
entering and exiting blocks are implicit in the difference between
lines, such as different indentation levels. By contrast, stream
oriented languages use explicit markers for such state changes, like {
and }
in C.)
Python deals with this in the tokenizer, which transforms changes in indentation level into synthetic INDENT and DEDENT tokens. One consequence of this is that the tokenizer is what enforces the rule that when you dedent you have to return to an existing previous indentation level, not something between one and another.
When I looked at Python's actual grammar (in human-readable form in
Grammar/Grammar
in the source distribution), I got a surprise: there
is almost no mention of INDENT and DEDENT tokens (and NEWLINE). In
particular, they're not mentioned in the definitions of potentially
multi-line things like lists. It turns out that this is because
the tokenizer silently swallows indentation changes and newlines
while inside (
's, [
's, and {
's.
Note particularly that this applies to all occurrences of '(
'.
The following is not a syntax error:
def foo(a, b, c, d, e, f): pass
(I can't say it's a good idea, though.)
This is not documented in the relevant bit of the Python language reference, so counting on it is unwise.
|
|