Wandering Thoughts archives

2012-10-06

Python can execute zip files

One of my long-running little bits of unhappiness is that Python strongly encourages modular programming but makes it awkward to write little programs in a modular way. Modules have to be separate files and once you have multiple files you have two problems; the main program has to be able to find those modules to load them, and you have to distribute multiple files and install them somehow instead of just giving people a self-contained file and telling them 'run this'. I recently found that there is a (hacky) way around this, although it's probably not news to people who are more plugged into Python distribution issues than I am.

The first trick is that Python can 'run' directories. If you have a directory with a file called __main__.py and you do python <directory>, Python will run __main__.py. Note that it does so directly, without importing the module; this has various awkward consequences. It will also do something similar to this with 'python -m <module>', but there the module must be on your Python search path and it will be imported before <module>/__main__.py is executed.

The second trick is that Python will import things (ie load code) from zipfiles, basically treating them as (encoded) directories; the exact specifics of this are beyond the scope of this entry (see eg here). As an extension of the first trick, Python will 'run' zipfiles as if they were directories; if you do 'python foo.zip' and foo.zip contains __main__.py, it gets run.

The third trick is that Python is smart enough to do this even when the 'zipfile' has a '#! ....' line at the start. In fact Python is willing to accept quite a lot of things before the actual zipfile; experimentally, it will skip lines that start with '#', blank lines, and lines that only have whitespace. In other words, you can take a zipfile that's got your __main__.py plus associated support modules and put a #!... line on the front to make it a standalone script (at least on Unix).

Since Python supports it, I strongly suggest also adding a second line with a '#' comment explaining what this peculiar thing is. That way people who try to look at your Python program won't get completely confused. Additional information is optional but possibly useful.

(I believe that all of this has been in Python for some time. I've just been slow to discover it, although I vaguely knew that Python could import code from zipfiles.)

Sidebar: zipfiles and byte-compilation

First off, as always (C)Python will only load .pyc precompiled bytecode files when (and if) you import modules. Your __main__.py will not have any bytecode version loaded so you want to make it as small as possible. Second, Python doesn't modify a zipfile when it imports code from it, which means that if you don't include .pyc files in your zipfile CPython will compile all your code to bytecode every time your program is run.

The solution is straightforward: run your program from its directory once (with some do-nothing arguments) before packing everything into a zipfile.

Note that this makes zipfiles somewhat less generic than you might like. CPython bytecode is specific to (roughly) the Python version, so eg Python 2.7 will not load bytecode generated by Python 2.6 and vice versa. Your zipfile program may run unchanged on both, but one may have a startup delay.

python/RunningZipfiles written at 23:44:08; Add Comment

How averages mislead you

To follow up on my illustrated example of this, I wanted to talk about how averages mislead people. They do it in at least two different ways.

The first way that averages mislead is that they smooth out exceptions. The longer the amount of time you average across and the more activity you see, the more that an average will hide exceptional activity (well, burry it under a mass of normal activity). You generally can't do very much about the amount of activity, so if you want to spot exceptions using an average you need to look at your 'average' over very short time intervals. Our recent issue was a great example of this. Exceptionally slow disk activity that wasn't really visible in a 60-second average did sometimes jump out in a one-second average. Of course the problem with fast averages is that then you generate a lot of results to go through (and also it's noisy).

It's worth understanding that this is not a problem with averages as such. Since the purpose of averages is to smooth things out, using an average should mean that you don't care about exceptions. If you do care about exceptions you need a different metric. Unfortunately people don't always provide one, which is a problem. The corollary is that if you're designing the statistics that your system will report and you plan to only report averages, you should be really confidant that exceptions either won't happen or won't matter. And you're probably wrong about both parts of that.

(Exceptional activity does affect even a long-term average, but it often doesn't affect it enough for things to be obviously wrong. Instead of saying 'this is crazy', you say 'hmm, things are slower than I was expecting'.)

The second way that averages mislead is that they hide the actual distribution of values. The usual assumption with averages is that you have a nice bell-shaped distribution centered around the average, but this is not necessarily the case. All sorts of distributions will give you exactly the same average and they have very different implications for how your system works. A disk IO system with a normal distribution centered on the average value is likely to feel very different from a disk IO system that has, say, two normal distributions superimposed on top of each other, one significantly faster than the average and one significantly slower.

(This is where my ignorance of most of statistics kicks in, because I don't know if there's some simple metrics that will give you a sense of the actual distribution is or if you really need to plot the distribution somehow and take a look at it.)

My illustrated example involved both ways. The so-so looking average was hiding significant exceptions and the exceptions were not random outliers; instead they were part of a distinct distribution. In the end it turned out that what looked like one distribution was in fact two distinct distributions stacked on top of each other, but that's another entry.

tech/MisleadingAveragesII written at 02:15:33; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.