How Python parses indentation

November 30, 2006

One of the interesting things about Python for compiler geeks is that it is a partially implicit context language, in that line indentation is significant. Implicit context languages been out of favour for a very long time, and most parsing techniques these days are geared towards stream oriented grammars.

(By 'implicit context languages' I mean ones where state changes like entering and exiting blocks are implicit in the difference between lines, such as different indentation levels. By contrast, stream oriented languages use explicit markers for such state changes, like { and } in C.)

Python deals with this in the tokenizer, which transforms changes in indentation level into synthetic INDENT and DEDENT tokens. One consequence of this is that the tokenizer is what enforces the rule that when you dedent you have to return to an existing previous indentation level, not something between one and another.

When I looked at Python's actual grammar (in human-readable form in Grammar/Grammar in the source distribution), I got a surprise: there is almost no mention of INDENT and DEDENT tokens (and NEWLINE). In particular, they're not mentioned in the definitions of potentially multi-line things like lists. It turns out that this is because the tokenizer silently swallows indentation changes and newlines while inside ('s, ['s, and {'s.

Note particularly that this applies to all occurrences of '('. The following is not a syntax error:

def foo(a, b,
           c, d,

 e, f):

(I can't say it's a good idea, though.)

This is not documented in the relevant bit of the Python language reference, so counting on it is unwise.

Written on 30 November 2006.
« Turning off HTTP basic authentication in urllib
Ubuntu's attention to detail in init.d scripts »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Thu Nov 30 23:31:24 2006
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.