2005-12-31
A logical consequence of def being an executable statement
I've mentioned before that in Python, def is actually an executable
statement (in FunctionDefinitionOrder). A logical consequences of this
is that default values for function arguments are evaluated only once,
when the def runs.
I say this because expressions generally get evaluated when any
Python statement runs, so the expressions in things like 'def
foobar(a, b=greeble(c), d=None):' are not being an exception. The
exception would be if they were not evaluated then and were instead
preserved as little lambda expressions to be evaluated later.
On an interesting side note, setting default values for arguments is
one of the two places in Python where the same variable name can be in
two different scopes simultaneously; the other is invoking a function
with keyword arguments. Everywhere else you write 'a=a' the two
a's are the same, but in these two cases the a being assigned to
is in the new function's scope and the expression's a is in your
current scope.
The result can be a little bit confusing, as you can see in StructsWithDefaults. (Which is one reason I like to avoid it.)
Sidebar: mutable default arguments
This means that mutable default arguments are usually not what you want, because if you change them they will stay mutated in subsequent invocations of the function. The usual pattern around this is something like:
def foobar(a, deflst=None):
if deflst is None:
deflst = []
....
2005-12-30
A Python surprise: the consequences of variable scope
The comment on my AClosureConfusion entry brought up a little Python surprise that I had known about in the back of my mind but never thought about fully before: the consequences of Python's scoping rules for variables.
Ignoring closures for the moment, Python only has two scopes for variables: global to the module and local to the entire function. (Closures introduce additional scopes for the variables of the 'outer' functions.)
Well, sure, you know that. But the consequence is that any variable
in a function is 'live' for the entire function, including variables
used only as the index in for loops and variables used only for
elements in list comprehensions. So when you write:
newlst = [x for x in lst if foobar(x)]
For the rest of the function (or the entire module) you will have an
x variable (in this case with the value of the last element of lst
at the time of the list comprehension).
This is a little bit surprising, at least for me, because
intellectually I usually consider such variables dead the moment the
list comprehension or for loop is done. For example, I don't read
their value after that point.
In some languages, index variables really are dead after the loop finishes; references to them outside the loop will get some sort of 'no such variable' error. In some other languages, such as perl, such scope restriction is optional but the recommended style.
2005-12-26
Thinking about a closure confusion
Consider the following Python code, which is a very simplified version of certain sorts of real code that people write:
ll = [] for i in range(1,10): ll.append(lambda z: i+z) print ll[0](0), ll[8](0)
A lot of people writing Python code like this for the first time expect to see it print '1 9', when in fact it prints '9 9'.
What I think is going on here is that people are thinking of closures
as doing 'value capture' instead of what I will call 'slot capture'.
In value capture the closure would capture i's current value, and
things would work right. In 'slot capture' the closure captures i's
slot in the stack frame and uses it to fish out the actual value when
the closure runs. Since i always uses the same slot on every go
through the for loop, every lambda captures the same slot value
and thus afterwards will evaluate to the same thing.
Slot capture is harder to think about because you have to know more about language implementation; in this case, you need to know what does and doesn't create a new, unique stack frame. For example, this slightly more verbose version of the code does work right:
def make(i): return lambda z: i+z ll = [] for i in range(1,10): ll.append(make(i))
Here the make function is needed for nothing more than magically
forcing the creation of a new unique stack frame, with the net effect
of capturing the value of each i in the lambdas. Is it any wonder
that people scratch their heads and get this wrong every so often?
You can think about this not as stack frames but as scopes. This may
make the make() example clearer: functions have a different scope
from their callers, but the inside of a loop is in the same scope as
outside it. (There are some languages where this is not true, so you
can define variables inside a loop that aren't visible after it
finishes. Even then, you may or may not get a new stack frame every
time through the loop. Aren't closures fun?)
This sort of closure confusion is not restricted to Python; here is an example of the same issue coming up in Javascript, in a real version of my Python example.
Scope rules can get quite interesting and complicated, and of course they interact with closures in fun ways. For example, Javascript Closures has a long writeup of the Javascript scope rules, which are somewhat more exciting than the Python ones. (It also has nice examples of the (abstract) implementation details.)
2005-12-22
What I really want is error-shielding interfaces
Recently (for my version of 'recently'), the blogosphere had a little tiff about 'humane' versus 'minimalist' APIs, starting with Martin Fowler's article and continuing onwards (there's a roundup here). To caricature the positions, the minimalist side feels that APIs should have only the basic building blocks, and the humane side feels that APIs should have all of the commonly used operations.
(Part of the fun of the whole exchange is that it got framed as a Ruby versus Java issue, due to the examples picked.)
I come down somewhere in between, because I am nervous about the large APIs that humane design creates but I think that minimalist APIs offload too much work to the programmer. What I like is APIs with basic building blocks and routines to do the common operations that are easy to get wrong. Unfortunately I have no snappier name for this approach than 'error-shielding interfaces'.
For example, getting the last element of a list. Something like
'aList.get(aList.size - 1)' contains enough code that there's
chances for errors, so I much prefer either 'aList.last' or
'aList.get(-1)'. As a bonus, they clearly communicate your intent to
people reading your code without needing them to decode the effects of
the expression. (I have a vague preference for the aList[-1]
approach, because it clearly generalizes to the Nth-last element.)
Similarly, I think that Python's .startswith() and .endswith()
string methods are great additions to the API. They're common enough,
and they make sure no programmer will ever write a stupid off by one
error in the equivalent Python code. (I've written it. And eyeballed
it carefully to make sure I got it right.)
In Python, there's another reason to add common operations to
extension module APIs: an extension module can often implement a
common operation significantly more efficiently than Python code
can. For example, the Python equivalent of .endswith() pretty much
has to make a temporary string object.
(There's also Ian Bicking's contribution, more or less on interface design in general here, which is well worth reading and thinking about.)
2005-12-19
Initializing Python struct objects with optional defaults
Recently I was writing code to register names and their
attributes. There were enough attributes that I didn't want to specify
all of them all of the time, so I did the obvious Python thing: I made
the register() function take a bunch of keyword arguments that had
default values. The attributes are stored with a
struct object, because I wanted an
attrs.attribute syntax for accessing them.
The straightforward way to initialize the struct object was to write
'vi = ViewInfo(factory = factory, onDir = onDir, ...)', but that
sort of repetition is annoying, especially when I had a perfectly good
set of name/value pairs in the form of register()'s arguments. If
only I could get at them.
It turns out that you can use the locals() dictionary for this, if
you use it before you set any local variables in the function. So:
class ViewInfo(Struct):
pass
def register(name, factory, onDir = False, \
onFile = True, ...):
vi = ViewInfo(**locals())
view_dict[name] = vi
(I did not strictly need the name attribute in the ViewInfo data,
but it doesn't do any harm and it meant I could use locals()
straight.)
A similar pattern can be done directly in a struct class as:
class ViewInfo:
def __init__(self, name, factory, \
onDir = False, ...):
for k, v in locals().items():
if k != "self":
setattr(self, k, v)
(You really want to exclude self, since circular references make the
garbage collector work harder than necessary.)
2005-12-18
Emulating C structs in Python
One of the few data types from C that I miss when writing Python code
is structs. The simplest replacement is dictionaries, but that means
you have to write thing['field'] instead of thing.field. I can't
stand that (it's the extra characters).
If you want thing.field syntax in Python, you need an object. The
simplest C struct emulation is just to use a blank object and
set fields on it:
class MyStruct: pass ms = MyStruct() ms.foo = 10 ms.bar = "abc"
Some people will say that this is an abuse of objects, since they don't have any code, just data. I say to heck with such people; sometimes all I want is data.
(Avoid the temptation to just use 'ms = object()', because it hurts
your ability to tell different types of structs apart via
introspection.)
Initialization this way is tedious, though. We can do it easier and more compactly by using keyword arguments when we create the object, with a little help from the class. Like so:
class Struct:
def __init__(self, **kwargs):
for k, v in kwargs.items():
setattr(self, k, v)
class MyStruct(Struct):
pass
ms = MyStruct(foo = 10, bar = "abc")
(And look, now our objects have some code.)
It's possible to write the __init__ function as
'self.__dict__.update(kwargs)', but that is fishing a little
too much into the implementation of objects for me. I would rather
use the explicit setattr loop just to be clear about what's going on.
(I am absolutely sure people have been using this idiom for years before I got here.)
Sidebar: dealing with packed binary data
If you need to deal with packed binary data in Python, you want the struct module.
This is a much better tool than C has, because structs are not good
for this (contrary to what some people think); structs do not
actually fully specify the memory layout. C compilers are free to
insert padding to make field access more efficient, which makes
struct memory layout machine and compiler dependent.
(I sometimes find it ironic that supposedly 'high level' languages like Python and Perl have better tools to deal with binary structures than 'low level' C.)
2005-12-15
Another introspection trick
Here's another example of Python's introspection and command interpreter being useful:
[x for x in dir(m) if isinstance(getattr(m, x), str) and 'localhost' in getattr(m, x)]
One of our Mailman lists had been accidentally set up thinking that
the machine's name was 'localhost', instead of the machine's actual
hostname, and this was causing problems. Mailman is written in Python
and offers access to the internals of list data via an interactive
Python interpreter (through the withlist program). This one-off bit of
introspection was basically a grep over all the lists's attributes.
I won't claim that this would be good style in a program, but as something I typed at the Python interpreter's command line it was very handy. In this, it's like shells and shell scripts; we do things on the command line that we'd never do in shell scripts.
(The isinstance() check is necessary to keep the next clause from
potentially throwing an exception on non-string attributes, which would
abort the entire list comprehension.)
2005-12-13
What Python threads are good for
Because of the sometimes much-maligned Global Interpreter Lock, pure Python code itself can't run simultaneously on multiple CPUs. So what should you use Python threads for?
The real use for Python threads is turning synchronous functions in
extension modules into asynchronous things that don't delay your main
program. Often these functions have no asynchronous equivalents (unlike
network IO), so it is either use threads or have your main program
delayed. This works for sufficiently compute-intensive functions as well
as functions, like socket.gethostbyname, that have to wait on outside
things.
Python threads are not a good way to do asynchronous network IO,
because it's inefficient overkill; use either select() or poll()
from the
select module
instead (along with non-blocking sockets and so on). If you need a
canned solution for this, consider
Twisted,
or asyncore and
asynchat from the
standard library.
Note that threads are the only way to make gethostbyname() and
gethostbyaddr() asynchronous, because they don't necessarily just do
DNS lookups. Exactly what data sources they consult and how is highly
system dependent; you really need to just be calling the platform C
library routines. This cuts both ways; if you want just DNS lookups,
do just DNS lookups via something like dnspython.
My thread-using Python programs wind up being built around completion
queues and thread pools; they hand off work to auxiliary threads and
then wait for things to finish. (Sometimes in conjunction with network
IO; see here for how I mix
work completion notification and select() et al.)
(Someday I will have a general 'thread pool' module that I'm happy with. I probably need to write more thread-using programs first.)
2005-12-09
Security versus resilience
A while back I wrote this, about an exception created by the cgi module when crackers submitted XML-RPC calls instead of form POSTs. It makes a great example for discussing the difference between 'secure systems' and 'resilient systems'.
Put broadly, security is keeping people out, while resilience is keeping operating when people attack you. The cgi module example shows that you can have one without the other. Sometimes this may even be deliberate; an exceptionally paranoid system could shut itself down any time it saw unexpected input, just to be sure. This would be quite secure but not at all resilient.
(There are real systems that are close to this paranoid, for example the PAL systems that try to prevent unauthorized use of nuclear weapons.)
The cgi module seems to be secure (and I say 'seems' only because I haven't personally analyzed the code). To a large extent Python makes it easy to be secure; you are protected from basic issues like buffer overruns, and exceptions force you to handle errors one way or another. Python code may fail, but it almost always fails safely. (This does leave design issues, where the code is right but the algorithm is horribly wrong, but no language can really help there.)
However, resilience is much harder and less common, as the cgi module example demonstrates (and there's a number of other ways to make programs using the cgi module unhappy). If this is sloppy programming on the part of the cgi module, then such sloppy programming is practically endemic; truly paranoid programming, even for network applications, is still rare. (And I'm not going to claim that I've managed it.)
I think that resilience is in general harder than security. Security is all about confining things and making sure that things don't happen, whereas resilience is about thinking about everything that could go wrong. This makes resilience much more of an open-ended problem than security, with many more things to think about and keep track of.
Because resilience is about 'what can go wrong?', it also needs you to go behind the convenient abstractions, like 'network IO is just a stream of bytes'. (It is, but it's a stream of bytes that may come very slowly or very fast, not come at all, or be incomplete. What happens to your program in each case?)
On a concrete level, I'm pretty confidant in DWiki's security, and its design has a certain amount of thought put into the issues. I'm equally confidant that DWiki is not resilient and that there are a bunch of ways (even without writing comments) to hammer it. (DWiki gets a certain amount of resilience from being run as a CGI-BIN by Apache, but this only goes so far.)
2005-12-03
How to do TCP keepalives in Python
TCP keepalives are do-nothing packets the TCP layer can send to see if a connection is still alive or if the remote end has gone unreachable (due to a machine crash, a network problem, or whatever). Keepalives are not default TCP behavior (at least not in any TCP stack that conforms to the RFCs), so you have to specifically turn them on. (There are various reasons why this is sensible.)
In Python you can do this with the .setsockopt() socket method,
using the socket.SO_KEEPALIVE option and setting a value of
integer 1. The only mystery is what the level parameter should be;
despite what you might guess, it is socket.SOL_SOCKET. So a
complete code example is:
import socketdef setkeepalives(sck): sck.setsockopt(socket.SOL_SOCKET, \ socket.SO_KEEPALIVE, 1)
Various sources recommend turning keepalives on as soon as possible after you have the socket.
(Keepalives are only applicable to TCP sockets, so one might expect
SOL_TCP or at least SOL_IP, but no; they are a generic socket
level option. Go figure.)
On Linux, you can control various bits of keepalive behavior by
setting the additional SOL_TCP integer parameters
TCP_KEEPIDLE, TCP_KEEPINTVL, and TCP_KEEPCNT; Python
defines them all in the socket module. See the tcp(7) manpage for
details. The default values are found in /proc/sys/net/ipv4 in the
files tcp_keepalive_time, tcp_keepalive_intvl, and
tcp_keepalive_probes, and are fairly large.