2013-11-27
The difference between CPython and Python
Sometimes when I'm writing about Python things, I talk about 'CPython' (as I did yesterday). This is insider jargon; CPython is the term of art that's generally used when we're specifically referring to the behavior of the main implementation of Python (which is written in C, hence the 'CPython' coinage). This is the implementation that gets most of the publicity and a starring role on python.org. CPython is the Python that is 'version 2.7.6' and 'version 3.3.3' (as of right now) and is what the core Python developers work on. But it's not the only implementation of Python that exists. Today the most prominent other implementation of Python is probably PyPy; other implementations include Jython (Python in the JVM) and IronPython (Python in the CLR).
CPython is the original version of Python and for a long time it was the only Python that existed. It's still the authoritative version that everyone else is expected to be compatible with, because there is no comprehensive language specification for 'Python the language'. This is pretty common with all sorts of languages these days, which are generally implemented first and standardized later if at all.
(Among other reasons for this, writing a comprehensive language specification is a lot of work and then it is even more work to keep updating it as you change the language. And you don't really know if your specification was comprehensive enough until some crazy person attempts a second implementation purely from the specification without looking at how your language implementation behaves. If their implementation is fully compatible, your specification was a good one.)
I (and others) talk about CPython when we're talking about things that are specific to how CPython is implemented, that are specifically documented as implementation dependent, or that are simply likely to be implementation dependent rather than slavishly copied by everyone who does a Python implementation. For obvious reasons, pretty much all of the low level details of how CPython works fall into this general category, eg other Python implementations are unlikely to copy CPython's bytecode architecture. Where the boundary is between low level behavior and high level behavior is an interesting and sometimes debatable question (as is what is likely to wind up being implementation dependent).
(Note that all Python implementations have the 'Python 2 vs Python 3' issue, because the changes between Python 2 and Python 3 are general language changes.)
2013-11-25
From CPython bytecode up to function objects (in brief)
Python bytecode is the low level heart of (C)Python; it's what the CPython interpreter actually processes in order to run your Python code. The dis module is the heart of information on examining bytecode and on the bytecodes themselves. But CPython doesn't just run bytecode in isolation. In practice bytecode is always part of some other object, partly because bytecode by itself is not self-contained; it relies on various other things for context.
Bytecode by itself looks like this:
>>> fred.func_code.co_code '|\x00\x00G|\x01\x00GHd\x00\x00S'
(That's authentic bytecode; you can feed it to dis.dis() to see
what it means in isolation.)
I believe that Python bytecode is always found embedded in a code
object. Code objects have two sorts of additional attributes; attributes
which provide the necessary surrounding context that the bytecode itself
needs, and attributes that just have information about the code that's
useful for debugging. Examples of context attributes are co_consts,
a tuple of constants used in the bytecode, and co_nlocals,
the number of local variables that the code uses. Examples of
information attributes are co_filename, co_firstlineno, and even
co_varnames (which tells you what local variable N is called).
Note that the context attributes are absolutely essential; bytecode
is not self-contained and cannot be run in isolation without them.
Many bytecodes simply do things like, say 'load constant 0'; if you
don't know what constant 0 is, you're not going to get far with the
bytecode. It is the code object that tells you this necessary stuff.
Most code objects are embedded in function objects (as the func_code
attribute). Function objects supply some additional context
attributes that are specific to using a piece of code as a function,
as well as another collection of information about the function
(most prominently func_doc, the function's docstring if
any). As it happens, all of the special function attributes are
documented reasonably well in the official Python data model, along with code
objects and much more.
(Because I just looked it up, the mysterious func_dict property is
another name for a function's __dict__ attribute, which is used
to allow you to add arbitrary properties to a function. See PEP 232. Note that functions don't
actually have a dictionary object attached to func_dict until you
look at it or otherwise need it.)
Function objects themselves are frequently found embedded in instance method objects, which are used for methods on classes (whether bound to an object that's an instance of the class or unbound). But that's as far up the stack as I want to go today and anyways, instance method objects only have three attributes and they're all pretty obvious.
(If you have a class A with a method function fred, A.fred is
actually an (unbound) instance method object. The fred function itself
is A.fred.im_func, or if you want, A.__dict__["fred"].)
Note that not all code objects are embedded in function objects. For
example, if you call compile() what you get back is a bare code
object. I suspect that module level code winds up as a code object
before getting run by the interpreter, but I haven't looked at the
interpreter source to see so don't quote me on that.
(This entry was inspired by reading this introduction to the CPython interpreter (via Hacker News), which goes at things from the other direction.)
2013-11-17
Sending and receiving file descriptors in Python
On some but not all modern Unix systems, file descriptors (the
underlying operating system level thing behind open files, sockets, and
so on) can be passed between cooperating processes using Unix domain
sockets and special options to sendmsg() and recvmsg(). There are a
number of uses for this under various circumstances; the one that I'm
interested in is selectively offloading incoming network connections
from one process to another one that is better suited to handle some
particular connections.
In Python 3.3 and later, doing this is simple because
it is directly supported by the socket module. The documentation
even includes code examples for both sendmsg() and recvmsg(),
which is handy because they don't exactly have the most Pythonic
of interfaces; instead it's basically a thin cover over the system
call data structures. If you are receiving file descriptors that
are sockets you're still left with the socket .fromfd() problem.
(I was encouraged to report the socket fd problem as an actual Python bug, where it has quietly been neglected just as I expected.)
Unfortunately Python 2 does not have any support for this in the
socket module, thereby
creating yet another gratuitous Python 2 to Python 3 difference.
Fortunately a number of people have written add on modules to support
this; the ones I found in a casual Internet search are python-passfd, python-fdsend, and sendmsg (which is notably lacking
in documentation). Of these, python-fdsend seems to have the
best API (and is packaged for Debian and Ubuntu); I expect that
it's what I'll use if (or when) I need this feature in my Python 2
code. Note that it doesn't solve the socket .fromfd() problem.
If you're sending sockets to another process, remember that it is safe
to call .close() on them afterwards but it is not safe to call
.shutdown() on them; as I discovered, shutdown() is a global
operation on a socket and applies to
all file descriptors for it, including ones now held by other processes.