2005-09-30
Pinging weblogs.com in Python
In the weblogs world, to 'ping' somewhere is to send an automated notice to an indexing site that your blog has updated. Popular indexing sites include weblogs.com and technorati.com. Major blogging packages already do this out of the box, but if you've rolled your own (such as me, with DWiki) you're on your own.
In a modern blog world with millions of blogs and RSS and Atom syndication feeds, something like weblogs.com looks more than a little bit old fashioned. The reason to still care about this stuff is two words: Google Blogsearch.
To quote from their FAQ:
How do I get my blog listed?
If your blog publishes a site feed in any format and automatically pings an updating service (such as Weblogs.com), we should be able to find and list it. [...]
In other words: until Google manually creates a way to add blogs, pinging weblogs.com and similar sites is the best and possibly the only way to get into their index. Certainly my experience is that WanderingThoughts got much more prompt and up to date indexing in Google Blogsearch after I started pinging places. (I believe that your feed needs to be autodiscoverable from your blog's web page for this to work.)
Mechanically, pings are done through XML-RPC. It turns out that Python's xmlrpclib module makes doing XML-RPC calls very easy, once you understand the magic tricks; they turn into function calls on magic objects. Making it easier, all of the various indexing services use the same XML-RPC procedure call, just to different URLs. (XML-RPC calls have two parts: the actual RPC call, with procedure name and arguments and so on, and the target it's directed at.)
So all you actually need is three lines of code you can find here (which is where I got my start from). To save you the small effort of going there, here's some slightly more general code:
import xmlrpclib
bName = "..."
bUrl = "http://..."
def pingEm(url):
s = xmlrpclib.Server(url)
s.weblogUpdates.ping(bName, bUrl)
pingEm("http://rpc.weblogs.com/RPC2")
pingEm("http://rpc.technorati.com/rpc/ping")
Fill in appropriate values for bName and bUrl and you're ready
to go.
If you want a list of possible places to ping (and more Python code), here is one list that seems to date from 2004. Other big lists of possible places to ping can be found here, here, or here. Apparently, Ping-O-Matic will ping pretty much everything important for you with just one ping from you (you want 'http://rpc.pingomatic.com/' as the XML-RPC target); they're even sort of recommended by Google.
(Disclaimer: I found these lists by using Google Blogsearch and haven't tried any of the listed ping sites myself. I suggest checking out each one's regular site to see if it looks worthwhile to ping them. And my, there are a lot of blog indexing things.)
2005-09-29
Something C programmers should not write in Python code
In C, a common idiom for initializing multiple variables to the same value is serial assignment:
a = b = c = 0;
Python allows for the same thing; just remove the semicolon in the
above and it's valid Python syntax, and it even works. So when I was
starting Python, I wrote things like that. And then one day I wrote
almost the same thing, in an __init__ function in a class:
self.fooDict = self.barDict = {}
The resulting debugging experience was very educational.
A C programmer expects multiple assignment to work using what I'll call 'value semantics', where the result of an assignment is a value and that value is then copied into the next variable. Thus, I'd (subconsciously) expected the Python assignment to be the same as:
self.fooDict = {}
self.barDict = {}
However, Python operates using reference semantics; the result of an assignment is merely another reference to what was assigned, so the next variable just gets another reference to the same thing. In other words, what I was actually getting was the same as:
self.barDict = {}
self.fooDict = self.barDict
Since my code required the two dictionaries to be distinct, and used the same keys in both, the result didn't work too well. (It also caused me to spend some time trying to hunt down where various strange entries were getting added to each dictionary before the penny dropped.)
I was led down this garden path partly because this does 'work' in the case of multiple assignment of a lot of simple Python things (including numbers). This is because they're immutable. If the object one variable points at can't change, it doesn't matter how many other people also point at it; the value can't change out from underneath you.
2005-09-27
Some hints on debugging memory leaks in Python programs
Programs written in garbage collected languages like Python aren't immune to memory leaks (except in a picky technical sense), just vastly less prone to them. Unfortunately, this rarity can leave you struggling with the problem should it come up.
In languages like C, memory leaks generally happen because you've forgotten about a piece of dynamically allocated memory. In Python, it's the opposite problem: you get memory leaks when you don't forget about objects, when your program holds references to objects that stop them from being garbage collected.
One easy way to see if you have a real Python-level memory leak is to use the code from GetAllObjects to count how many objects are in the system. If this keeps growing, you have a problem. (If your program's total memory usage keeps climbing, you also have a problem but it may or may not be due to a memory leak in your code.)
Given the nature of memory leaks in GC languages, a good first place to look for retained references is caches (which are there explicitly to remember things). Make sure your caches have aging policies and that they work right, and watch out for caching unexpectedly large objects.
(In long running programs, you need to make sure that long lived objects aren't unnecessarily large and don't hold too many references to other objects. This can call for things like slimmed-down variants of objects, or deliberately destroying some of the references on an object when it's going into long-life mode.)
Another thing to look at is cyclic data structures or groups of data structures with cross-references; they make it much easier to indirectly keep a large data structure live without noticing. A specific case of this is tree-like data structures where the 'flow' of references is bidirectional; for example, a tree where nodes hold references to parents as well as children. In such cases, a live reference to any node can keep the entire tree alive.
There are more obscure ways to hold references alive, including:
- a reference cycle involving an object with a
__del__method. As the gc module mentions, Python can't pick an order to destroy things in so it just stuffs them into a holding list for your program to look at. - as local variables in a function that hasn't exited yet. These
references are logically dead (you'll never use them again), but
Python doesn't know it yet.
delis useful in these cases. - in threads, especially if you have data structures that are periodically replaced with new versions. This is a more extended version of 'local variables in a not yet exited function'.
- the odd one: bound into function closures, through capturing the outer function variables at the time that the function closure was created.
Sometimes the memory leaks aren't because you have more objects, but because the size of the objects are growing. One common one is an ever growing string buffer, partly because strings are one of the few variable sized non-container objects. Counting objects won't turn this up; to find it you'll need to check the total length of strings you have.
The gc module and the
code from GetAllObjects can be used to browse around your program's
object state to hunt for clues. Obvious starting points are questions
like 'how many objects of class X exist', but you can also do things
like use gc.get_referrers to backtrack from an object that should
now be dead to what is holding it alive. (sys.getrefcount() may also
be useful.)
Additional resources
- A rough size calculator for module object counts; see the 'Rough Size Calculator'.
- An introductory article on preventing memory leaks in long-running Zope instances.
- Tracking down memory leaks in Python is dated (as of Python 2.0, cycles don't cause memory leaks) but has good advice on avoiding non-obvious reference cycles.
- Here is a message from comp.lang.python with code to find the names of variables that refer to particular objects, which may be useful in tracking down what exactly points to now-unwanted objects.
The Zope project has a TrackRefs class that is part of their test program, but it apparently requires a debug build of the main Python interpreter. If this sounds interesting to you, visit their SVN repository, navigate to Zope/trunk, and pick up test.py. (I'd give a direct URL, but I'm not sure how to give a stable one into a SVN repository.)
Sidebar: Python before Python 2.0
If you're targeting a version of Python before 2.0, you need to more or less completely avoid circular references. Before 2.0, Python used only reference counting to collect garbage, causing any circular or cyclic references to make all of the objects involved immortal (as their reference counts would never go to zero because of the reference cycle).
Sidebar: the other cause of memory usage growth
The other way your program's memory use can keep growing is if your object usage pattern is fragmenting the interpreter's usage of system memory. One discussion of part of this issue is in this blog entry on Python memory management.
And all of this assumes that you're not having to deal with a compiled extension module that has memory management problems of its own. Some XML modules are apparently well known to leak memory if not used exactly right, and there's always outright bugs.
2005-09-20
Two Python import tricks
Here's an extreme example of import and namespace trickery, drawn
from the standard Python ftplib module:
try: import SOCKS socket = SOCKS del SOCKS from socket import getfqdn socket.getfqdn = getfqdn del getfqdn except ImportError: import socket
(linebreaks added for clarity)
Modules are objects in Python, so a module's name binding can be
manipulated like anything else. This leads to the two things that
ftplib is doing here.
First, the ftplib module would like to use the SOCKS module if it
exists and otherwise use the standard socket module. Rather than
sprinkle conditionals throughout its own code, it just imports SOCKS
and renames it to socket, counting on SOCKS to have all the socket
routines it needs under the same names as socket does.
Second, SOCKS is apparently missing the getfqdn function. So
ftplib yanks a reference to it out of the real socket module and
stuffs it into the SOCKS module (that ftplib knows as 'socket').
This lets its code can just call socket.getfqdn without having to
worry about whether socket is the real thing or SOCKS.
This works because of bonus trickery: unlike other operations, both
'import' and 'from <blah> import ...' take the names without
looking them up in the current namespace. So even though the name
'socket' points to the SOCKS module at this point, 'from socket
import getfqdn' goes off to the real socket module, instead of
trying to find getfqdn in the SOCKS module.
Because modules are global, the side effect of this is that ftplib
has changed SOCKS's namespace for everyone that's imported it. If
somewhere in the program another module has also imported SOCKS,
they can refer to 'SOCKS.getfqdn' and it will work.
(There are actually practical uses of altering other modules' namespaces. I may cover some of them in a later entry.)
2005-09-19
Function definition order in Python
Thomas Boutell has just started serious Python work and wound up noticing (in the middle of here):
What's not so painless is discovering that "function" must be defined like this before I can call it.
Wait a minute. That's pretty standard behavior for C, yeah. But this ain't 1973. Perl and Java are both bright enough to read the rest of your code and find the function, allowing you to place functions in your source code where they feel right to you.
So: why is Python this way? The answer's deeper and more interesting than it might look.
There are three things going on here, and they all make sense when you understand them:
- Functions are found by name lookup, like pretty much everything else in Python. So the name has to be defined by the time something tries to call the function.
- Doing an '
import <module>' executes the file's code; the same thing happens when you do 'python file'. (What's left in the namespace after the code finishes running basically is the module.) defis an executable statement; when it runs it creates the name and points it to a function object, which has the function's bytecode and some trimmings. (classis also an executable statement.)
(Technically import only executes the file's code the first time
around. After that, it looks up the module in an internal table and
horks back a reference.)
Understanding the third is actually quite important, because it is how
and why a great many clever tricks with classes work (including
straightforward uses of staticmethod and classmethod decorators).
Consider:
class Foo:
def bar():
return 42
bar = staticmethod(bar)
This works because everything in the class statement is actually
being executed. The def creates and binds bar's value to a
function object, and then the next line rebinds bar's value to a
different object created by staticmethod. When the dust settles, the
class's namespace has only the new bar.
(staticmethod itself is an ordinary function (well, a type); you can
run it outside a class definition if you want to, although the objects
it creates are not really useful outside of classes.)
This is also why function closures work, because they are recreated with the right bound variables every time around. Consider:
def foo(a, b):
def bar(z):
return a + b + z
return bar
Although this doesn't literally reparse bar's source each time,
it does make a new version of bar on each call. (You can see this
by using the dis module
to look at the generated bytecodes.)
So Python is not strict 'define before reference', the way a language
like C is, but 'define before use'. You can have function definitions
in any order and in any place, so long as they're all defined before
any executing code tries to use them. However, if you put code at the
top level of a file (where it will get run at import time) the
functions the code uses must be lexically before the code.
This approach to function definition is by no means unique to Python. LISP was probably the first language to do it, but you can find this in lots of others too.
Because there's nothing magical about function names, there's a number
of tricks that can be played with them. For example, as we saw with
'class Foo', function names can be rebound to other values. You can
also create new functions just by binding an appropriate value to a
name.
(One use of this is to decide on the fly which version of code will
implement a generic interface; an extreme example is the os module
(look for os.py). os.py is also a good illustration how far you
can go by running code during import.)
(This entry is adopted from my LiveJournal comments on here.)
2005-09-16
Getting a list of all objects in Python
One of the most 'interesting' issues in most garbage collected languages is memory usage analysis: figuring out why your program is using so much memory, and where. Often this winds up enmeshed in tricky issues of object lifetime, retained references, and so on.
One of the first steps in this sort of work is simply figuring out what live objects you have in memory and what they are. Jonathan Ellis recently ran into this issue and wound up asking an interesting question: how do you get a list of all live objects in a Python program?
Fortunately for my ability to look clever on short notice, I already wrestled with this very question while developing a long running network daemon (our SMTP and NNTP frontend). (Memory usage analysis in Python is a big subject; I hope to write about other facets of it in later entries.)
Part of Python's good introspection support is the
gc module,
which pokes into the internal garbage collector. gc.get_objects()
looks like just what we want, but unfortunately (as Jonathan Ellis
found out)
it doesn't return a complete object list. Particularly, it seems to
skip objects that don't contain other objects (and not all container
objects are on it, either).
(Important update: I was wrong about several things to do with
gc.get_objects, including it not including all container objects.
See GetAllObjectsII for the corrections and qualifications to the above.)
To get a full list, you need to recursively expand the initial
gc.get_objects() list, while keeping track of what objects
you've already expanded to avoid duplicating things referred to from
multiple locations and circular reference loops. To save you the
effort of writing this code, here's my version:
import gc# Recursively expand slist's objects # into olist, using seen to track # already processed objects. def _getr(slist, olist, seen): for e in slist: if id(e) in seen: continue seen[id(e)] = None olist.append(e) tl = gc.get_referents(e) if tl: _getr(tl, olist, seen) # The public function. def get_all_objects(): """Return a list of all live Python objects, not including the list itself.""" gcl = gc.get_objects() olist = [] seen = {} # Just in case: seen[id(gcl)] = None seen[id(olist)] = None seen[id(seen)] = None # _getr does the real work. _getr(gcl, olist, seen) return olist
(Disclaimer: this code is not going to be completely accurate in a threaded program unless you figure out how to stop all the other threads from modifying container objects while it runs. Even then it'll just be a snapshot of one moment when it started.)
2005-09-13
Concurrency is tricky
Concurrency may or may not be hard (I know people who disagree with me on that). But I am sure that it is tricky. As an illustration, I just fixed a DWiki concurrency bug that I first spotted in MyFirstCommentSpam.
For simplicity, DWiki stores each comment as a file in a directory
hierarchy that mirrors the page's DWiki path; if you comment on the
DWiki page /foo/bar, the comment will be a file in a /foo/bar/
directory (under a separate top-level directory). DWiki makes these
directory hierarchies on demand; the first time someone comments on a
DWiki page, DWiki makes all of the elements of the comment directory
hierarchy that don't already exist.
DWiki does this with code like this:
try:
if not os.path.isdir(loc):
os.makedirs(loc)
except EnvironmentError, e:
raise ... an internal error
(os.makedirs() conveniently makes all of the directories in one
shot, like 'mkdir -p'.)
This is perfectly ordinary code and I didn't think twice about
it. Except that there's a concurrency problem: if two (or more)
comments on the same page are posted at the same time, and this is the
first time the page has been commented on, they can race in this small
section. Both can see no directory, then start os.makedirs(), but
only one will actually make it; the other one will eventually try to
mkdir() a directory that already exists, which is an error.
The truly reliable cure requires much more complicated code, because you cannot just do:
try: os.makedirs(loc) except EnvironmentError, e: pass # ... do things
The problem is that os.makedirs() can fail due to intermediary
directories too. If processes A and B are both trying to make all
directories in /foo/bar/baz/:
- process A makes
/foo/. - process A is preempted
- process B tries to make
/foo/. Because it exists,os.makedirs()errors out. - process B continues on, assuming that
/foo/bar/baz/now exists. - incorrectness ensues.
The concurrency problem with the simple solution is not in my code,
it's in how os.makedirs() is implemented. To know about it, I have
to either examine the code or try experiments, and producing
concurrency races on demand is not trivial. (Fortunately, I can
examine the code in this situation.)
Concurrency is tricky because it's easy to overlook cases. And it's not just your code that matters, it's also all the library routines or standard modules that your code depends on. And the authors may not have so much overlooked cases as considered them outside the specification.
Sidebar: the concurrency safe makedirs()
The trick is to modify os.makedirs() slightly to only raise an error
after os.mkdir() if the target directory still isn't there, since
that's the condition that we really care about. The result:
def makedirs(name):
head, tail = os.path.split(name)
if not tail:
head, tail = os.path.split(head)
if head and tail and not os.path.exists(head):
makedirs(head)
try:
os.mkdir(name)
except EnvironmentError:
if not os.path.isdir(name):
raise
2005-09-09
When Python classes are pointless
This LiveJournal posting wound up raising the question of when one could or should avoid using classes to implement something. In Python, my warning signs are some critical mass of:
- lots of NonMethodFunctions on the class.
- either every piece of data is part of the single object, or nothing is.
- only one object of the class is ever created by the program (as distinct from 'only one copy at any given time, but we throw it away and make a new one every so often').
At a certain point, using a class this way is just hiding the fact that you are writing plain functions that manipulate global variables by tarting things up so it is all 'object oriented'. I would rather that code be clear and honest about what's going on by just doing it. Not infrequently the result is also simpler.
Python gets a small out, since objects are the best way to implement structures, and sometimes a global structure simplifies namespace issues and the code. However, Python modules are also good ways to manage namespaces, so it's not too much of an out.
If you are guessing from this that I really dislike the 'Singleton' design pattern, you would be entirely correct.