2006-08-31
SIGCHLD versus Python: a problem of semantics
In the process of looking at my program's code again to write the last entry, I think I may have solved the mystery of how my impossible exception gets generated.
My program does a lot of forking and thus cleanups of now-dead children. The code that it generally dies on is:
def _delip(pid, ip):
del ipmap[ip][pid]
if len(ipmap[ip]) == 0:
del ipmap[ip]
It takes a KeyError on the len(ipmap[ip]) operation and goes down.
(Because of previous fun, the main
thread forks all the children and waits for them, so this kills the
entire program.)
Clearly there is some concurrency problem, but my problem with the
exception was that I've never seen where it could come from. The main
thread is the only thread that adds or removes things from the ipmap
dictionary, and the SIGCHLD handler that reaps children is only active
when the thread is idling in select() (partly to avoid just this
sort of concurrency issue).
To avoid various problems and just create sanity, Unix SIGCHLD
handlers are not reentrant; even if more children die, you won't receive
a second SIGCHLD until you return from the signal handler. (This is
an interesting source of bugs if you bail out of the signal handler
without telling the kernel, and is one reason for the existence of
siglongjmp().)
And in thinking about all of this I came to a horrible realization:
those are Unix semantics, not Python semantics. Python does not
run your Python-level SIGCHLD handler from the actual C level signal
handler; it runs them from the regular bytecode interpreter. All the C
level SIGCHLD handler does is set a flag telling the interpreter to
run your SIGCHLD handler at the next bytecode, where it gets treated
pretty much as an ordinary function call.
This would neatly explain my mysterious exceptions. When there are two
connections from an IP address and both of them die in short succession,
if we are extremely unlucky the SIGCHLD for the second will be
processed between _delip's first and second lines and delete the
ipmap[ip] dictionary entry out from underneath the first.
I personally believe that this is a bug in the CPython interpreter, but even if I can persuade the Python people of this, I still need to come up with a Python-level workaround for the mean time (ideally one that doesn't involve too much code reorganization).
2006-08-30
A problem with debugging threaded Python programs
I have a heavily threaded Python program that dies every now and then with a mysterious exception (that as far as I can see just shouldn't happen, which means that I don't completely understand my code).
My off and on attempts to debug this have pointed out a frustrating problem in Python's (lack of) support for debugging threaded Python programs: there's no such thing as a global exception, something that will freeze, backtrace, and terminate all threads when an error occurs. Instead, exceptions only capture the state of the thread they happen in, and only kill it.
(My program goes down because the exception is happening in the main thread, and all the other threads are short-lived.)
The irony is that Python's single-threaded bytecode interpreter would make this relatively easy to do. You could guarantee that the next bytecode-level action every other thread did was to throw an exception, and you'd be capturing pretty close to the exact state when the starting exception was created. (With some more work, you could make it an exact capture; just keep track of the last bytecode executed and generate the exception against it.)
2006-08-07
A problem in Python's implementation of closures
Python's implementation of closures for inner functions has a well known problem: you can't mutate the binding of captured outer variables. In other words, the following code does not work:
def cfunc(a):
def _if(b):
a = a + b
return a
return _if
There is nothing in the semantics of Python that require this result.
Unlike the case of writing to a global variable in a function, what the
a variable refers to in the scope of _if is completely unambiguous
at all times.
I could make excuses for CPython, but the problem is pretty much
there deliberately; while there is a special bytecode instruction
(LOAD_DEREF) to read the value of captured variables in a
closure, there is no bytecode instruction to write to them. In the
absence of the ability to do anything else, the interpreter does its
standard thing and treats any variable that
is stored to in the function as function local (barring a global
declaration).
The careful phrasing I have had to use in the first paragraph shows the
way around this problem: while you can't change the binding of captured
outer variables in a closure, you can mutate their value directly if the
type of their value allows this. The classical way to do this is to make
the desired variables into arrays, and then mutate the array contents.
So we would write cfunc as:
def cfunc(a):
t = [a]
def _if(b):
t[0] = t[0] + b
return t[0]
return _if
This version does what we want it to, at the expense of a certain amount of ugliness.
(Credit where credit is due department: I think I first saw the array trick in the sample WSGI server code in its specification.)