2012-10-06
Python can execute zip files
One of my long-running little bits of unhappiness is that Python strongly encourages modular programming but makes it awkward to write little programs in a modular way. Modules have to be separate files and once you have multiple files you have two problems; the main program has to be able to find those modules to load them, and you have to distribute multiple files and install them somehow instead of just giving people a self-contained file and telling them 'run this'. I recently found that there is a (hacky) way around this, although it's probably not news to people who are more plugged into Python distribution issues than I am.
The first trick is that Python can 'run' directories. If you have
a directory with a file called __main__.py
and you do python
<directory>
, Python will run __main__.py
. Note that it does so
directly, without importing the module; this has various awkward
consequences. It will also do something similar to this with 'python -m
<module>
', but there the module must be on your Python search path and
it will be imported before <module>/__main__.py is executed.
The second trick is that Python will import things (ie load code) from
zipfiles, basically treating them as (encoded) directories; the exact
specifics of this are beyond the scope of this entry (see eg here). As an extension of the
first trick, Python will 'run' zipfiles as if they were directories; if
you do 'python foo.zip
' and foo.zip contains __main__.py
, it gets
run.
The third trick is that Python is smart enough to do this even when
the 'zipfile' has a '#! ....
' line at the start. In fact Python is
willing to accept quite a lot of things before the actual zipfile;
experimentally, it will skip lines that start with '#
', blank lines,
and lines that only have whitespace. In other words, you can take a
zipfile that's got your __main__.py
plus associated support modules
and put a #!...
line on the front to make it a standalone script (at
least on Unix).
Since Python supports it, I strongly suggest also adding a second line
with a '#
' comment explaining what this peculiar thing is. That way
people who try to look at your Python program won't get completely
confused. Additional information is optional but possibly useful.
(I believe that all of this has been in Python for some time. I've just been slow to discover it, although I vaguely knew that Python could import code from zipfiles.)
Sidebar: zipfiles and byte-compilation
First off, as always (C)Python will only load
.pyc precompiled bytecode files when (and if) you import modules. Your
__main__.py
will not have any bytecode version loaded so you want
to make it as small as possible. Second, Python
doesn't modify a zipfile when it imports code from it, which means that
if you don't include .pyc files in your zipfile CPython will compile all
your code to bytecode every time your program is run.
The solution is straightforward: run your program from its directory once (with some do-nothing arguments) before packing everything into a zipfile.
Note that this makes zipfiles somewhat less generic than you might like. CPython bytecode is specific to (roughly) the Python version, so eg Python 2.7 will not load bytecode generated by Python 2.6 and vice versa. Your zipfile program may run unchanged on both, but one may have a startup delay.
How averages mislead you
To follow up on my illustrated example of this, I wanted to talk about how averages mislead people. They do it in at least two different ways.
The first way that averages mislead is that they smooth out exceptions. The longer the amount of time you average across and the more activity you see, the more that an average will hide exceptional activity (well, burry it under a mass of normal activity). You generally can't do very much about the amount of activity, so if you want to spot exceptions using an average you need to look at your 'average' over very short time intervals. Our recent issue was a great example of this. Exceptionally slow disk activity that wasn't really visible in a 60-second average did sometimes jump out in a one-second average. Of course the problem with fast averages is that then you generate a lot of results to go through (and also it's noisy).
It's worth understanding that this is not a problem with averages as such. Since the purpose of averages is to smooth things out, using an average should mean that you don't care about exceptions. If you do care about exceptions you need a different metric. Unfortunately people don't always provide one, which is a problem. The corollary is that if you're designing the statistics that your system will report and you plan to only report averages, you should be really confidant that exceptions either won't happen or won't matter. And you're probably wrong about both parts of that.
(Exceptional activity does affect even a long-term average, but it often doesn't affect it enough for things to be obviously wrong. Instead of saying 'this is crazy', you say 'hmm, things are slower than I was expecting'.)
The second way that averages mislead is that they hide the actual distribution of values. The usual assumption with averages is that you have a nice bell-shaped distribution centered around the average, but this is not necessarily the case. All sorts of distributions will give you exactly the same average and they have very different implications for how your system works. A disk IO system with a normal distribution centered on the average value is likely to feel very different from a disk IO system that has, say, two normal distributions superimposed on top of each other, one significantly faster than the average and one significantly slower.
(This is where my ignorance of most of statistics kicks in, because I don't know if there's some simple metrics that will give you a sense of the actual distribution is or if you really need to plot the distribution somehow and take a look at it.)
My illustrated example involved both ways. The so-so looking average was hiding significant exceptions and the exceptions were not random outliers; instead they were part of a distinct distribution. In the end it turned out that what looked like one distribution was in fact two distinct distributions stacked on top of each other, but that's another entry.