Wandering Thoughts archives

2006-11-30

How Python parses indentation

One of the interesting things about Python for compiler geeks is that it is a partially implicit context language, in that line indentation is significant. Implicit context languages been out of favour for a very long time, and most parsing techniques these days are geared towards stream oriented grammars.

(By 'implicit context languages' I mean ones where state changes like entering and exiting blocks are implicit in the difference between lines, such as different indentation levels. By contrast, stream oriented languages use explicit markers for such state changes, like { and } in C.)

Python deals with this in the tokenizer, which transforms changes in indentation level into synthetic INDENT and DEDENT tokens. One consequence of this is that the tokenizer is what enforces the rule that when you dedent you have to return to an existing previous indentation level, not something between one and another.

When I looked at Python's actual grammar (in human-readable form in Grammar/Grammar in the source distribution), I got a surprise: there is almost no mention of INDENT and DEDENT tokens (and NEWLINE). In particular, they're not mentioned in the definitions of potentially multi-line things like lists. It turns out that this is because the tokenizer silently swallows indentation changes and newlines while inside ('s, ['s, and {'s.

Note particularly that this applies to all occurrences of '('. The following is not a syntax error:

def foo(a, b,
           c, d,

 e, f):
  pass

(I can't say it's a good idea, though.)

This is not documented in the relevant bit of the Python language reference, so counting on it is unwise.

PythonIndentationParsing written at 23:31:24; Add Comment

2006-11-29

Turning off HTTP basic authentication in urllib

Python's urllib module conveniently handles a great many bits of fetching URLs for you, including HTTP basic authentication. Unfortunately it does this by pausing everything to 'ask the user for the required information on the controlling terminal' (to more or less quote from the documentation). This is generally not the most useful behavior in the world, and can even be rather disconcerting.

(A more sensible default behavior would have been to either raise an exception or return a HTTP 'authentication required' status.)

As a result, all of my urllib-using programs start off by neutering this behavior, so that if I ask them to deal with a stray URL that requires HTTP basic authentication they'll just fail. To do this, you need to subclass FancyURLopener and supply your own get_user_passwd routine that does nothing:

from urllib import FancyURLopener
class MyOpener(FancyURLopener):
  def get_user_passwd(self, h, r, c_c = 0)
    return None, None

This is covered in the urllib documentation, sort of, but not even the urllib docstrings tell you what the return value should be. Apparently you are just supposed to read the source.

(Technically you can supply a do-nothing prompt_user_passwd routine instead, with the same effect, but I prefer to just neuter the whole thing.)

This is not the only peculiar thing urllib does. For another example, it turns basically every problem into IOError exceptions, generally in blithe disregard of their standard format. (And of course it is now far too late to fix this, because it would break backwards compatibility for everyone who has carefully worked around this.)

There's the urllib2 module as an alternative, but the length of its documentation makes my eyes glaze over. However, on some testing it seems to be reasonably simple to use, and as a bonus it does the right thing with HTTP basic authentication. I suspect that I should start switching my code over, and certainly use it for anything new I do.

(I still use urllib because I started with Python in the 1.5 days, when there was no urllib2. Technically we still have some machines with 1.5.2, but on those machines we just use a locally built version of 2.3.4.)

DisablingBasicAuth written at 23:03:12; Add Comment

2006-11-09

The importance of printable objects

I have a small defect in the Python code I produce: I rarely bother to make my classes printable or to give them a repr(). Most of the classes will never be printed, and the default repr value is good enough to distinguish two instances from each other.

But this is a mistake, nicely illustrated by my grump about assert's weakness as a debugging tool. Objects having a useful string value makes it much easier to dump out information about the state of things when a problem comes up. You can cope without it and I usually have, but it's working harder than you should have to.

While the convention of making the repr value something that can be used to reproduce the object is nice, don't let it stop you from having a repr value of some sort. You're not really losing anything when the alternative is a '<foo.bar object at 0xdeadbeef>' thing, although you probably should make sure that you can still tell apart two instances that happen to have identical values.

(You can do without this if your objects have both an equality operator and a hash operator. With just an equality operator, you may someday wind up trying to figure out why an object is not found as the key in a dictionary when you can see that it's right there darnit.)

The default repr function for instances is more or less equivalent to:

"<%s.%s at 0x%x>" % (self.__class__.__module__, self.__class__.__name__, id(self))

The one wart is that id() should really return a Python long that's always positive, instead of an integer that's sometimes negative on 32-bit platforms. On 32-bit platforms you can mask id() with 0xffffffff to get the right value and avoid annoying warnings, but of course this blows up on 64-bit platforms.

PrintImportance written at 00:09:05; Add Comment

2006-11-05

An interesting Python garbage collection bug

Derived from here:

import sys
x = 10
del sys.modules['__main__']
print x

If you execute this at the interpreter prompt, you get a NameError that x is not defined. If you execute it as a file (or if you put the sys.modules and print bit in a function and run the function), you get None printed.

The interpreter situation makes a certain amount of sense, once you discover that the interpreter is holding no references to anything from prompt to prompt. The del drops the reference count on the real __main__ to zero, causing it to be garbage collected, and you are left with no namespace at all.

(Technically what happens is that the interpreter recreates a new __main__ at the next line, but it is empty. You can't do much with this, since you can't successfully import __builtins__ to start reconstructing an environment.)

The function case makes sense once you think about cyclic garbage collection. While the executing function has a reference to __main__, __main__ also holds a reference to it, making it a cyclic reference and thus not enough to keep __main__ alive. The peculiar result is because of how CPython cleans up modules that are getting garbage collected; in order to help out destructors, it follows a complicated dance of setting module-level names to None (in two passes, and excluding __builtins__, so that destructors can still get at builtin objects), instead of actually deleting them outright.

I call this a bug because I believe that executing code should be considered to be holding an external reference to its module, and that the interpreter should similarly hold an external reference to __main__ in general. Python may be doing what you told it to here, but it's not anywhere near what I think most people would expect.

InterestingGCBug written at 21:55:08; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.