Wandering Thoughts archives

2013-10-09

An interesting bug with module garbage collection in Python

In response to my entry on what happens when modules are destroyed, @eevee shared the issue that started it all:

@thatcks the confusion arose when the dev did `module_maker().some_function_returning_a_global()` and got None :)

In a subsequent exchange of tweets, we sorted out why this happens. What it boils down to is a module is not the same as the module's namespace and functions only hold a reference to the module namespace, not the module itself.

(Functions have a __module__ attribute but this a string, not a reference to the module itself.)

So here's what is going on. When this chunk of code runs module_maker() loads and returns a module as an anonymous object then the interpreter uses that anonymous module object to look up the function. Since the function does not hold a reference to the module itself, the module object is unreferenced after the lookup has finished and is thus immediately garbage collected. This garbage collection destroys the contents of the module namespace dictionary, but the dictionary itself is not garbage collected because the function holds a reference to it and the interpreter holds a reference to the function. Then the function's code runs and uses its reference to the dictionary to look up a (module) global, which finds the name and a None value for it.

(You would get even more comedy if the module function tried to call another module level function or create an instance of a module level class; this would produce mysterious 'TypeError: `NoneType' object is not callable' errors since the appropriate name is now bound to None instead of a callable thing.)

The workaround is straightforward; you just have to store the module object in a local variable before looking up the function so that a reference to it persists over the function call and thus avoids it being garbage collected.

The good news is that this weird behavior did wind up being accepted as a Python bug; it's issue 18214 and is fixed in the forthcoming Python 3.4. Given the views of the Python developers, it will probably never be fixed in Python 2 and will thus leave people with years of having to work around it.

(It's hopefully obvious why this is a bug. Given that modules and module namespaces are separate things and that a module's namespace can outlive it for various reasons, a module being garbage collected should not result in its namespace dictionary getting trashed. This sort of systematic destruction of module namespaces should only happen when it's really necessary, namely during interpreter shutdown.)

ModuleGCBug written at 00:23:10; Add Comment

2013-10-07

What happens when CPython deletes a module

The first thing to know about what happens when Python actually deletes and garbage-collects a module is that it doesn't happen very often; in fact, you usually have to go out of your way to observe it. The two ways I know of to force a module to be deleted and garbage collected are to end your program (which causes all modules to eventually go through garbage collection) and to specifically delete all references to a module, including in sys.modules. People don't do the latter very often so mostly this is going to come up at program shutdown.

(Before I started investigating I thought that reloading a module might destroy the old version of it. It turns out that this is not what happens.)

As I mentioned in passing a long time back, CPython actually destroys modules in a complex multi-pass way. First, all module level names that start with a single underscore are set to None. This happens in some random order (actually dictionary traversal order) and if this drops an object's reference count to nothing it will be immediately garbage collected. Second, all names except __builtins__ are set to None (again in the arbitrary dictionary traversal order).

Note that object garbage collections are not deferred until all entries have been set to None; they happen immediately, on the fly. If you peek during this process, for example in a __del__ method on an object being cleaned up, you can see some or all of the module-level variables set to None. Which ones in your pass are set to None is random and likely variable.

There are two comments in the source code to sort of explain this. The first one says:

To make the execution order of destructors for global objects a bit more predictable, we first zap all objects whose name starts with a single underscore, before we clear the entire dictionary. We zap them by replacing them with None, rather than deleting them from the dictionary, to avoid rehashing the dictionary (to some extent).

Minimizing the amount of pointless work that's done when modules are destroyed is important because it speeds up the process of exiting CPython. Not deleting names from module dictionaries avoids all sorts of shuffling and rearrangement that would otherwise be necessary, so is vaguely helpful here. Given that the odd two-pass 'destructor' behavior here is not really documented as far as I know, it's probably mostly intended for use in odd situations in the standard library.

The other comment is:

Note: we leave __builtins__ in place, so that destructors of non-global objects defined in this module can still use builtins, in particularly 'None'.

What happens to object destructors in general during the shutdown of CPython is a complicated subject because it depends very much on the order that modules are destroyed in. Having looked at the CPython code involved (and how it differs between Python 2 and Python 3), my opinion is that you don't want to have any important object destructors running at shutdown time. You especially don't want them to be running if they need to finalize some bit of external state (flushing output or whatever) because by the time they start running, the infrastructure they need to do their work may not actually exist any more.

If you are very curious you can watch the module destruction process with 'python -vv' (in both Python 2 and Python 3). Understanding the odd corners of the messages you see will require reading the CPython source code.

(This entire excursion into weirdness was inspired by a couple of tweets by @eevee.)

Sidebar: The easy way to experiment with this

Make yourself a module that has a class with a __del__ method that just prints out appropriate information, set up some module level variables with instances of this class, and then arrange to delete the module. With suitable variables and so on you can clearly watch this happen.

The easy way to delete your test module is:

import test
del test
del sys.modules["test"]

Under some circumstances you may need to force a garbage collection pass through the gc module.

ModuleDestructionDetails written at 23:16:59; Add Comment

2013-10-06

What reloading a Python module really does

If you're like me, your initial naive idea of what reload() does is that it re-imports the module and then replaces the old module object in sys.modules with the new module object. Except that that can't be right, because that would leave references to the old module object in any other module that had imported the module. So the better but still incorrect vision of reloading is that it re-imports the module as a new module object then overwrites the old module's namespace in place with the new module's namespace (making all references to the module use the new information). But it turns out that this is still wrong, as is hinted in the official documentation for reload().

What seems to really happen is the new module code is simply executed in the old module's namespace. As the new code runs it defines names or at least new values for names (including for functions and classes since def and class are actually executable statements) and those new names (or values) overwrite anything that is already there in the module namespace. After this finishes you have basically overwritten the old module namespace in place with all of the new module's names and binding and so on.

This has two consequences. The first is mentioned in the official documentation: module level names don't get deleted if they're not defined in the new version of the module. This includes module-level functions and classes, not just variables. As a corollary, if you renamed a function or a class between the initial import and your subsequent reload the reloaded module will have both the old and the new versions.

The second is that module reloads are not atomic in the face of some errors. If you reload a module and it has an execution error partway through, what you now have is some mix of the new module code (everything that ran before the error happened) and old module code (everything afterwards). As before this applies to variables, to any initialization code that the module runs, and to class and function definitions.

What I take away from this is that module reloading is not something that I want to ever try to use in live production code, however convenient it might be to have on the fly code updating in a running daemon. There are just too many ways to wind up with an overall program that works now but won't work after a full restart (or that will work differently).

(This behavior is the same in Python 2 and Python 3, although in Python 3 the reload() call is no longer a builtin and is now a module level function in the imp module.)

ReloadRealBehavior written at 21:59:12; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.