2006-11-30
How Python parses indentation
One of the interesting things about Python for compiler geeks is that it is a partially implicit context language, in that line indentation is significant. Implicit context languages been out of favour for a very long time, and most parsing techniques these days are geared towards stream oriented grammars.
(By 'implicit context languages' I mean ones where state changes like
entering and exiting blocks are implicit in the difference between
lines, such as different indentation levels. By contrast, stream
oriented languages use explicit markers for such state changes, like {
and } in C.)
Python deals with this in the tokenizer, which transforms changes in indentation level into synthetic INDENT and DEDENT tokens. One consequence of this is that the tokenizer is what enforces the rule that when you dedent you have to return to an existing previous indentation level, not something between one and another.
When I looked at Python's actual grammar (in human-readable form in
Grammar/Grammar in the source distribution), I got a surprise: there
is almost no mention of INDENT and DEDENT tokens (and NEWLINE). In
particular, they're not mentioned in the definitions of potentially
multi-line things like lists. It turns out that this is because
the tokenizer silently swallows indentation changes and newlines
while inside ('s, ['s, and {'s.
Note particularly that this applies to all occurrences of '('.
The following is not a syntax error:
def foo(a, b,
c, d,
e, f):
pass
(I can't say it's a good idea, though.)
This is not documented in the relevant bit of the Python language reference, so counting on it is unwise.
2006-11-29
Turning off HTTP basic authentication in urllib
Python's urllib module conveniently handles a great many bits of fetching URLs for you, including HTTP basic authentication. Unfortunately it does this by pausing everything to 'ask the user for the required information on the controlling terminal' (to more or less quote from the documentation). This is generally not the most useful behavior in the world, and can even be rather disconcerting.
(A more sensible default behavior would have been to either raise an exception or return a HTTP 'authentication required' status.)
As a result, all of my urllib-using programs start off by neutering this
behavior, so that if I ask them to deal with a stray URL that requires
HTTP basic authentication they'll just fail. To do this, you need to
subclass FancyURLopener and supply your own get_user_passwd routine
that does nothing:
from urllib import FancyURLopenerclass MyOpener(FancyURLopener): def get_user_passwd(self, h, r, c_c = 0) return None, None
This is covered in the urllib documentation, sort of, but not even the urllib docstrings tell you what the return value should be. Apparently you are just supposed to read the source.
(Technically you can supply a do-nothing prompt_user_passwd routine
instead, with the same effect, but I prefer to just neuter the whole
thing.)
This is not the only peculiar thing urllib does. For another example, it
turns basically every problem into IOError exceptions, generally in
blithe disregard of their standard format. (And of course it is now far
too late to fix this, because it would break backwards compatibility for
everyone who has carefully worked around this.)
There's the urllib2 module as an alternative, but the length of its documentation makes my eyes glaze over. However, on some testing it seems to be reasonably simple to use, and as a bonus it does the right thing with HTTP basic authentication. I suspect that I should start switching my code over, and certainly use it for anything new I do.
(I still use urllib because I started with Python in the 1.5 days, when there was no urllib2. Technically we still have some machines with 1.5.2, but on those machines we just use a locally built version of 2.3.4.)
2006-11-09
The importance of printable objects
I have a small defect in the Python code I produce: I rarely bother to make my classes printable or to give them a repr(). Most of the classes will never be printed, and the default repr value is good enough to distinguish two instances from each other.
But this is a mistake, nicely illustrated by my grump about assert's weakness as a debugging tool. Objects having a useful string value makes it much easier to dump out information about the state of things when a problem comes up. You can cope without it and I usually have, but it's working harder than you should have to.
While the convention of making the repr value something that can be used to reproduce the object is nice, don't let it stop you from having a repr value of some sort. You're not really losing anything when the alternative is a '<foo.bar object at 0xdeadbeef>' thing, although you probably should make sure that you can still tell apart two instances that happen to have identical values.
(You can do without this if your objects have both an equality operator and a hash operator. With just an equality operator, you may someday wind up trying to figure out why an object is not found as the key in a dictionary when you can see that it's right there darnit.)
The default repr function for instances is more or less equivalent to:
"<%s.%s at 0x%x>" % (self.__class__.__module__, self.__class__.__name__, id(self))
The one wart is that id() should really return a Python long that's
always positive, instead of an integer that's sometimes negative
on 32-bit platforms. On 32-bit platforms you can mask id() with
0xffffffff to get the right value and avoid annoying warnings, but of
course this blows up on 64-bit platforms.
2006-11-05
An interesting Python garbage collection bug
Derived from here:
import sys x = 10 del sys.modules['__main__'] print x
If you execute this at the interpreter prompt, you get a NameError
that x is not defined. If you execute it as a file (or if you put
the sys.modules and print bit in a function and run the function),
you get None printed.
The interpreter situation makes a certain amount of sense, once you
discover that the interpreter is holding no references to anything
from prompt to prompt. The del drops the reference count on the real
__main__ to zero, causing it to be garbage collected, and you are left
with no namespace at all.
(Technically what happens is that the interpreter recreates a new __main__ at the next line, but it is empty. You can't do much with this, since you can't successfully import __builtins__ to start reconstructing an environment.)
The function case makes sense once you think about cyclic garbage
collection. While the executing function has a reference to __main__,
__main__ also holds a reference to it, making it a cyclic reference and
thus not enough to keep __main__ alive.
The peculiar result is because of how CPython cleans up modules that are
getting garbage collected; in order to help out destructors, it follows
a complicated dance of setting module-level names to None (in two
passes, and excluding __builtins__, so that destructors can still get at
builtin objects), instead of actually deleting them outright.
I call this a bug because I believe that executing code should be considered to be holding an external reference to its module, and that the interpreter should similarly hold an external reference to __main__ in general. Python may be doing what you told it to here, but it's not anywhere near what I think most people would expect.