Wandering Thoughts archives

2013-10-07

What happens when CPython deletes a module

The first thing to know about what happens when Python actually deletes and garbage-collects a module is that it doesn't happen very often; in fact, you usually have to go out of your way to observe it. The two ways I know of to force a module to be deleted and garbage collected are to end your program (which causes all modules to eventually go through garbage collection) and to specifically delete all references to a module, including in sys.modules. People don't do the latter very often so mostly this is going to come up at program shutdown.

(Before I started investigating I thought that reloading a module might destroy the old version of it. It turns out that this is not what happens.)

As I mentioned in passing a long time back, CPython actually destroys modules in a complex multi-pass way. First, all module level names that start with a single underscore are set to None. This happens in some random order (actually dictionary traversal order) and if this drops an object's reference count to nothing it will be immediately garbage collected. Second, all names except __builtins__ are set to None (again in the arbitrary dictionary traversal order).

Note that object garbage collections are not deferred until all entries have been set to None; they happen immediately, on the fly. If you peek during this process, for example in a __del__ method on an object being cleaned up, you can see some or all of the module-level variables set to None. Which ones in your pass are set to None is random and likely variable.

There are two comments in the source code to sort of explain this. The first one says:

To make the execution order of destructors for global objects a bit more predictable, we first zap all objects whose name starts with a single underscore, before we clear the entire dictionary. We zap them by replacing them with None, rather than deleting them from the dictionary, to avoid rehashing the dictionary (to some extent).

Minimizing the amount of pointless work that's done when modules are destroyed is important because it speeds up the process of exiting CPython. Not deleting names from module dictionaries avoids all sorts of shuffling and rearrangement that would otherwise be necessary, so is vaguely helpful here. Given that the odd two-pass 'destructor' behavior here is not really documented as far as I know, it's probably mostly intended for use in odd situations in the standard library.

The other comment is:

Note: we leave __builtins__ in place, so that destructors of non-global objects defined in this module can still use builtins, in particularly 'None'.

What happens to object destructors in general during the shutdown of CPython is a complicated subject because it depends very much on the order that modules are destroyed in. Having looked at the CPython code involved (and how it differs between Python 2 and Python 3), my opinion is that you don't want to have any important object destructors running at shutdown time. You especially don't want them to be running if they need to finalize some bit of external state (flushing output or whatever) because by the time they start running, the infrastructure they need to do their work may not actually exist any more.

If you are very curious you can watch the module destruction process with 'python -vv' (in both Python 2 and Python 3). Understanding the odd corners of the messages you see will require reading the CPython source code.

(This entire excursion into weirdness was inspired by a couple of tweets by @eevee.)

Sidebar: The easy way to experiment with this

Make yourself a module that has a class with a __del__ method that just prints out appropriate information, set up some module level variables with instances of this class, and then arrange to delete the module. With suitable variables and so on you can clearly watch this happen.

The easy way to delete your test module is:

import test
del test
del sys.modules["test"]

Under some circumstances you may need to force a garbage collection pass through the gc module.

ModuleDestructionDetails written at 23:16:59; Add Comment

2013-10-06

What reloading a Python module really does

If you're like me, your initial naive idea of what reload() does is that it re-imports the module and then replaces the old module object in sys.modules with the new module object. Except that that can't be right, because that would leave references to the old module object in any other module that had imported the module. So the better but still incorrect vision of reloading is that it re-imports the module as a new module object then overwrites the old module's namespace in place with the new module's namespace (making all references to the module use the new information). But it turns out that this is still wrong, as is hinted in the official documentation for reload().

What seems to really happen is the new module code is simply executed in the old module's namespace. As the new code runs it defines names or at least new values for names (including for functions and classes since def and class are actually executable statements) and those new names (or values) overwrite anything that is already there in the module namespace. After this finishes you have basically overwritten the old module namespace in place with all of the new module's names and binding and so on.

This has two consequences. The first is mentioned in the official documentation: module level names don't get deleted if they're not defined in the new version of the module. This includes module-level functions and classes, not just variables. As a corollary, if you renamed a function or a class between the initial import and your subsequent reload the reloaded module will have both the old and the new versions.

The second is that module reloads are not atomic in the face of some errors. If you reload a module and it has an execution error partway through, what you now have is some mix of the new module code (everything that ran before the error happened) and old module code (everything afterwards). As before this applies to variables, to any initialization code that the module runs, and to class and function definitions.

What I take away from this is that module reloading is not something that I want to ever try to use in live production code, however convenient it might be to have on the fly code updating in a running daemon. There are just too many ways to wind up with an overall program that works now but won't work after a full restart (or that will work differently).

(This behavior is the same in Python 2 and Python 3, although in Python 3 the reload() call is no longer a builtin and is now a module level function in the imp module.)

ReloadRealBehavior written at 21:59:12; Add Comment

2013-09-20

Nested conditional expressions in Python (and code golf)

Recently I had an occasion to use a nested (or chained) conditional expression. I haven't used conditional expressions much, so at first I just wrote out what struck me as the obvious way:

res = a.field1 if a.field1 else obj.field2 if obj else None

(The goal is to use a.field1 if it's got a value, obj.field2 if obj is there, and otherwise None.)

Then I paused to ask myself if this was going to have the right grouping of evaluation; testing said that it did, to my pleasant surprise. It's always nice when Python behaves the way I expected it to and my naive code works. That it happens so often is one of the reasons that I like Python so much.

While this nested conditional expression was the obvious way to write the expression (since I was converting it from what would otherwise be a nested if), it's possible to be more clever. The simple way is to get rid of the actual conditional expressions in favour of exploiting the side effects of or and and:

res = a.field1 or (obj and obj.field2) or None

(Given when I'm trying to do here this doesn't suffer from the usual problem of (ab)using and and or this way.)

Of course we can golf this code further:

res = a.field1 or getattr(obj, 'field2', None)

To my mind this is well over the line into excessively clever, partly because it mixes two different ways to refer to fields in the same expression. Even the first condensed version is not something I'm entirely happy with (partly because it's subtly different than the straightforward version using conditional expressions). So my initial version is going to stay in my code.

(I think I've basically recanted on my views about avoiding conditional expressions in Python by now. Time moves on and I get used to things.)

NestedConditionalExprs written at 22:49:29; Add Comment

2013-09-17

The pain (or annoyance) of deploying a simple WSGI thing

It started on Twitter:

@eevee: it is time once again to set up a small innocuous wsgi thing and i am again reminded that this sucks and i want to fix it so bad

@thatcks: If only deploying a WSGI thing was as easy as PHP. And I wish I was making a joke there.

(Note that @eevee is the author of the infamous rant PHP: a fractal of bad design.)

Some number of people are now thinking that there's no problem here and that WSGI apps are pretty easy to deploy. After all there's lots of WSGI servers, many of them quite good, and it's not like it's hard to hook one up to your main (or frontend) web server. An entry here, an entry there, a new daemon process, and you're done. Maybe you even use Apache's mod_wsgi, which gets it down to a configuration entry (and a server restart, but you probably needed that anyways).

Well, here's the simple PHP deployment process: put a .php file in the appropriate spot in your web server's document root. You're done.

(As noted by @bobpoekert, CGIs also have basically this property.)

Yes, yes, of course there is a great pile of stuff behind the scenes to make that work. And of course it isn't as flexible and as scalable as the full bore WSGI version. But it demonstrates what a simple deployment actually is and frankly a simple basic deployment is all that 99% of all web apps need (the existence proof is all of the basic PHP apps). Even a relatively full-featured WSGI deployment should only require two files and nothing else (one actual .wsgi file and one file to describe things like what URL it connects to), with the pieces to make it work integrated with your web server.

(The actual running of your WSGI app could be in a separate container daemon that's handled by a separate supervision process. That's an implementation detail that you shouldn't have to care about for a simple WSGI deploy process any more than you should have to care about the details of how your web server communicates with your WSGI app.)

As a side note, as a sysadmin I happen to think that standalone daemons are both a pain in the rear and utterly the wrong approach for a scalable deployment of lots of apps with unpredictable load. But that's another blog entry.

WSGIDeploymentPain written at 00:15:20; Add Comment

2013-09-13

Why I think dir() excludes metaclass methods

I was recently reading David Halter's Why Python's dir function is wrong which points out that dir() on classes and types excludes some valid attributes from the result (for example, __bases__). As it happens, I have a theory about why Python behaves this way. The short version is that it is a heuristic to make dir() more useful.

(Note that classes and types are the same thing. From now on I'm going to use 'class' to mean both.)

When you use dir() on a class there are at least two things you can be interested in, namely what attributes will be visible on an instance of that class (ie, what attributes are defined by the class and its inheritance hierarchy) and what attributes are visible on the class itself (more or less). My feeling is that almost all of the time people use dir() they are more interested in the former question; in fact, I'd expect that many people don't even know that you can have attributes visible on a class but not its instances.

Even ignoring the direct usability issue, doing dir() 'properly' has a couple of other significant drawbacks. First, you lose any good way to find out what attributes will be visible on instances of classes; you'll wind up wanting to add a special flag to dir() to return to the old behavior. Second, the result of dir(anyclass) is likely to be confusing in practice because it will mingle instance-visible and class-only attributes, including reasonably common special methods. Most obviously, every class has a __call__ special method from type() but it can only be called on the class itself.

It's probably worth mentioning the obvious thing, namely that metaclasses didn't used to exist while dir() is very old. Effectively this set the behavior of dir() as excluding metaclasses; you can imagine the uproar if dir() had suddenly added a bunch of methods when used on classes (including giving __call__ to everyone) when metaclasses were introduced. This might well have been (correctly) regarded as a significant change from existing behavior.

(This seems especially likely as I believe that there is some code that uses dir() on classes for introspection.)

Also I've played fast and loose with something here, because dir() is actually not the list of what attributes will be visible on anything. dir() is the list of attributes that are defined on particular things, but additional attributes can be materialized in various ways and defined attributes can even be made inaccessible if you try hard enough. This may suggest why any change in dir()'s behavior has not been a high priority for Python developers; in a sense you usually shouldn't be using it in the first place. And (as pointed out on reddit) this dir() behavior is explicitly documented and it does even more magic than this.

PS: for more on metaclasses and many low level details about them, see my index to my metaclass series.

Sidebar: what additional attributes you would see on normal classes

You can get this list in a simple way:

sorted(set(dir(type)) - set(dir(object)))

Rather than put a big chunk of output here I think I'm going to leave it to my readers to run this for themselves if they're interested.

Classes with custom metaclasses would add any additional attributes from those custom metaclasses.

OnDirAndMetaclasses written at 01:14:39; Add Comment

2013-08-28

I'm vaguely yearning for a simpler framework than Django

I was going to say that this entry was about my disenchantment with Django but that isn't really accurate. Django works fine for what it is (although there are rough edges), it's just that it has a bunch of complexity and a certain amount of magic and (more importantly) it feels like a vast overkill for what I periodically want to do. The drawback of batteries being included is that they weight a lot.

(In a metaphorical sense. I'm not worried about Django's resource usage.)

In a way, the problem is that there is both too little and too much for a simpler framework to do. To really take a burden off the developer, you need:

  • templating.
  • form display, validation, and redisplay (including things like protections against CSRF).
  • URL routing.
  • some sort of database mapping layer.
  • some way to use database models outside of the web application itself, to enable cron jobs and commands and so on.
  • an optional layer to map database entries into forms and vice versa (which should be security smart).

(In a modern web framework you probably also want support for JSON service endpoints for frontend JavaScript or the like. Arguably that's like forms (and their database mappings), just without the complex HTML in the middle. Unfortunately I have very little idea what this should look like since I have quite little experience with frontend JavaScript.)

Given that this is most of what Django covers, I'm not sure that any framework (or set of framework components) that covers this is really going to be 'simple'. Especially unlikely to be simple is the bridge between database entries and forms, but it's also a very important component if you actually are working with a database.

(I've been thinking about this partly because my Django modularity design puzzle keeps rattling around in the back of my mind. It's quite possible that I won't need to use any of Django's powerful features in the eventual web app, which in theory means I could write it with a simpler framework instead of trying to fight how Django wants me to do everything.)

PS: pragmatically, I should go through eg Eevee's discussion of Python web development frameworks and read the documentation for the various frameworks to see how they cover these areas.

SimplerFrameworkDesire written at 00:38:03; Add Comment

2013-08-10

The importance of names, illustrated through my mistake

A while back I refactored chunks of DWiki's core DWikiText to HTML rendering code to fix a fundamental error that I'd made in the original design. Today I discovered (entirely by accident) that in the process I had accidentally broken most of DWiki's on-disk caching of that rendering. Well, 'broken' is a strong word and not really an adequate description. What I'd done was stop using it.

In the new refactored code there were (and are) three important functions. _render() is the low-level function that does the actual rendering and returns a render results object, _render_cached() layers caching on top of _render(), and finally _render_html() calls _render_cached() and then gives you the full HTML from the resulting RRO.

DWiki generates HTML by expanding templates that wind up invoking high-level HTML generation functions (you can see the huge list here); a number of them take DWikiText and generate various HTML forms of it (with or without the title, only the first paragraph, only the title, and so on). Several of those wanted the full HTML so they used _render_html() and got caching. Others wanted some subset of the HTML and so couldn't use _render_html(). What they should have used was _render_cached(); instead, I coded them all to use _render(). As a consequence none of them used any on-disk caching; every time they were used they re-rendered the wikitext from scratch.

(The really embarrassing one was wikitext:cache, which exists in large part simply to cache stuff.)

I can see more or less how I made the mistake. After all, the name of the function I had them call is exactly what I wanted when I was writing the code (and it even had a nice short and direct name). It's just that it has (or really lacks) some side effects and I completely overlooked those side effects because the name didn't shove them in front of me. In hindsight a much better name for _render() would actually have been _render_uncached(); if it had had that name I would have been directly confronted by what it didn't do any time I put it in code and as a result I wouldn't have done so.

Sometimes I learn my lessons the hard way. I hope that this one sticks. If I'm smart I'll take the time and annoyance to rename the functions to better names, lest I recreate this mistake the next time I revisit this area of DWiki code.

(This bugfix is not yet in the Github version but will be in a day or so, when it doesn't blow up here.)

Sidebar: what did and didn't get cached, and how I missed this initially

When I first did this refactoring everything that Wandering Thoughts used got cached (and I watched the caches get populated when I initially made the change here). But that was because I didn't immediately take advantage of the new features I'd also added. When I revised Atom feeds to omit the title from the generated HTML and changed several aspects of how entries look (eg, adding dates in between the entry title and its main text), I changed to using HTML generators that didn't cache. Of course by then I wasn't looking at the cache because I knew it all worked.

The net result was that the only thing that was using the on-disk cache was looking at Wandering Thoughts through the front page, the sub-pages for categories, or paging back through these indexes. Neither generating Atom feeds nor looking at individual entries was using the on-disk cache.

(It turns out that enough things (I think mostly web spiders) walk through Wandering Thoughts using the index pages to fill up the on-disk cache enough that I didn't notice them being unusually empty.)

NameImportance written at 01:58:28; Add Comment

2013-07-31

A Python code structure problem: exception handling with flags

Here's a Python (2.x) code structure puzzle that I don't have a good answer for yet, except that maybe the answer is that my overall design is a bad fit for what I'm doing. To start with, suppose that you have a multi-level, multi-step process of processing lines from an input file. Any number of things can go wrong during the processing; when it does, you need to bubble this information up to the top level but keep on going to process the next line (if only so you can report all of the errors in the file in one pass). The obvious fit for this is to have errors communicated by raising exceptions which are then trapped at the top level.

Now let's suppose there are several different sorts of errors and you want to treat some of them specially based on command line flags. For example normally all errors are fatal and show error messages, but some can be suppressed entirely with a flag (they just cause the record to be silently skipped) and some can be made into warnings. How do you structure this in the code?

My first version looked something like this:

try:
   data = crunch(line)
   ....
except A, e:
   report(e)
   commit = False
except B, e:
   report(e)
   if not option.skipempty:
      commit = False
except C, e:
   if not option.skipdups:
      report(e)
      commit = False

All of the duplication here made me unhappy because it obscured the actual logic and makes it easy for one exception to drift out of sync with the handling for the others. I can aggregate everything together with 'except (A, B, C), e:' but then the question is how to write the single exception handler so that it's both clean and does everything necessary; so far I've thought of two approaches. The first approach is to use isinstance() on e to tell what sort of exception we have and then write out the conditions in if's, except that trying to do that makes for ugly long conditions.

(I started to write out the example above and basically exploded in irritation when I got to the commit logic, which I decided was a bad sign. It also looked like the result would be very hard to read, which means that errors would be easy to add.)

The second solution I've come up with is to add attributes to each exception class, call them report and nocommit. Then at the start of the code we do:

if options.skipempty:
   B.nocommit = False
if options.skipdups:
   C.report = False
   C.nocommit = False

In the main code we do:

try:
   ....
except (A, B, C), e:
   if e.report:
      report(e)
   if e.nocommit:
      commit = False

This avoids both duplication and lack of clarity at the expense of, well, kind of being a hack.

(You can also code a variant of this where report and nocommit are functions that are passed the options object; this puts all of the 'what turns this off' logic into the exceptions themselves instead of reaching into the exception classes to (re)set attributes. That might be somewhat cleaner although it's more verbose.)

Given that all of the options have drawbacks I feel that there ought to be a clever Python code design trick that I'm missing here.

ExceptionHandlingWithFlags written at 00:38:13; Add Comment

2013-07-23

When Python regexp alternation is faster than separate regexps

In Eli Bendersky's Regex-based lexical analysis in Python and Javascript he wrote:

Shortly after I started using it, it was suggested that combining all the regexes into a single regex with alternation (the | regex syntax) and using named groups to know which one matched would make the lexer faster. [...]

This optimization makes the lexer more than twice as fast! [...]

At first I was going to write an entry explicitly noting that Bendersky had demonstrated that you should use regex alternation instead of multiple regexps. Then I reread my old entries on regexp performance and even reran my test program to see if Python 2 (or 3) had changed the picture and now I need to write a more complicated entry.

I will give you the quick summary: if you are using .match() or some other form of explicitly anchored search, using regexp alternation is faster than separate regular expressions. If you are using an unrestricted .search() the situation is much murkier; it really depends on how many regexps you have and possibly what you're searching for. If you have a decent number of alternates it's probably faster to use real regexp alternation. If you have only two or three it's quite possible that using separate regexps will be a win.

Eli Bendersky's lexer is faster with regexp alternation for both reasons; it uses .match() and has a number of alternatives.

Current Python 2 and 3 regexp performance for .search() appears basically unchanged from my earlier entries (although Python 3.3.0 does noticeably worse than Python 2 on the same machine). In particular the number of regexp alternatives you use continues to have very little to no effect on the performance of the resulting regular expression. There is a performance penalty for .search() with regexp alternation, even if the first alternate matches (and even if it matches at the front of the string).

(It appears that you pay a small penalty for .match() if you use regexp alternation but this penalty is utterly dwarfed by the penalty of repeatedly calling .match() for separate regexps. If you need alternatives with .match() you should use regexp alternation.)

PS: my test program continues to be available if you want to run it in your own Python environment; see the link in this entry. I have to say that preserving test programs is a great idea that I should do much more often.

(Test programs are often ugly quick hacks. But being able to rerun the exact same tests much later can be extremely useful, as is being able to see exactly what the tests were.)

RegexpAlternationWhen written at 00:38:12; Add Comment

2013-07-19

A bit on the performance of lexers in Python

This all starts with Eli Bendersky's Hand-written lexer in Javascript compared to the regex-based ones (via) where he writes in part:

I was expecting the runtime [of the hand written lexer] to be much closer to the single-regex version; in fact I was expecting it to be a bit slower (because most of the regex engine work is done at a lower level). But it turned out to be much faster, more than 2.5x.

In the comments Caleb Spare pointed to Rob Pike's Regular expressions in lexing and parsing which reiterates the arguments for simple lexers that don't use regular expressions. Despite all of this, regular expression based lexers are extremely common in the Python world.

Good lexing and even parsing algorithms are both extremely efficient and very well known (the problem has been studied almost since the start of computer science). A good high performance lexer generally looks at each input character only once and runs a relatively short amount of focused code per character to tokenize the input stream. A good regular expression engine can avoid backtracking but is almost invariably going to run more complex code (and often use more memory) to examine each character. As covered in Russ Cox's series on regular expressions, garden variety regular expression engines in Python, Perl, and several other languages aren't even that efficient and do backtrack (sometimes extensively).

So why don't people use hand-written lexers in Python? Because most Python implementations are slow. Your hand-written lexer may examine each character only once with simple code but that efficiency advantage is utterly crushed by the speed difference between the C-based regular expression engine and the slowness of interpreting your Python code. Hand written JavaScript lexers are comparatively fast because modern JavaScript interpreters have devoted a lot of effort to translating straightforward JavaScript code into efficient native code. Since the lexer actually is straightforward, a JIT engine is generally going to do quite well on it.

(Given this I wouldn't be surprised if a hand-written Python lexer that was run under PyPy was quite fast, either competitive with or even faster than a Python regex-based one. Assembling a test case and doing the benchmarking work is left as an exercise.)

PythonLexerPerformance written at 20:31:00; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.