2014-05-22
Why Python uses bytecode (well, probably): it's simpler
A while back I read Why are there so many Pythons?, which in passing talks about the Python's internal use of bytecode and says:
In very brief terms: machine code is much faster, but bytecode is more portable and secure.
If you replace 'is' with 'can be', this is true. But it's not the reason that the main implementation of Python (hereafter 'CPython') uses bytecode. One clue that this isn't the case is that the .pyc files of bytecodes are not all that portable; they can differ between Python versions and possibly even different types of machines with the same version of CPython.
Put simply, CPython almost certainly uses bytecode because creating and then interpreting bytecode is a common implementation technique for writing reasonably complex interpreters. All interpreters need to parse their source language and then interpret (and execute) something, but it's often simpler (and faster) to transform the initial source language into some simpler format before interpreting it. A common 'simpler format' is some form of abstract bytecode, often extremely specialized to the language being interpreted and also how data is stored inside the interpreter.
(On modern CPUs, another advantage of transforming things from an abstract syntax tree of the parsed code to bytecode is that the bytecode can be made linear in memory and thus more cache friendly. Modern CPUs really don't like bouncing around all over to follow pointers; the less you do it the better.)
CPython's bytecode is just this sort of abstract bytecode. In theory you could describe it as a simple stack machine, but in practice the stack is only used for storing temporary values while computing expressions and so on. Actual Python variables and so on are accessed through a whole series of specialized bytecode instructions. The bytecode also has special instructions for things like creating instances of standard types like lists and dealing with iterators, none of which can be described as either general-purpose outside of Python or anything like a real computer.
(And sometimes the exact details of this bytecode matter.)
As for security, Python bytecode is not necessarily all that secure by itself. While it doesn't allow you to perform random machine operations, I wouldn't be surprised if hand-generating crazy instruction sequences could do things like crash CPython (in fact, I'm pretty confidant that doing this is relatively trivial) and lead to arbitrary code execution. The CPython bytecode interpreter is not intended as a general interpreter but instead as an interpreter for bytecode generated by CPython itself, which is guaranteed to obey the rules and not do things like attempt to get or set nonexistent function local variables.
Or to put it directly: it is not safe at all to have CPython run untrusted bytecode, even in a theoretically relatively captive environment. This is completely independent of what access the bytecode might have (or be able to contrive) to standard library functions like file and network access. Untrusted bytecode doesn't need access to stuff like that to wreak havoc.
(I can't be absolutely sure that this is why CPython uses bytecode because I haven't asked the Python developers about it, but I would be truly surprised if it was any other reason. Compiling to bytecode and interpreting the bytecode is a classic and standard interpreter implementation technique and CPython itself is a pretty classic implementation of it.)
2014-05-09
Some uses for Python's 'named' form of string formatting
I expect that every Python programmer is familiar with Python's
normal way of formatting strings with % and 'printf' style format
specifications. Let's call this normal way of formatting things a
'positional' way, because it's based on the position of the arguments
given to be formatted. But as experienced Python programmers know,
this is not the only way you can set up your formatting strings;
you can also set them up so that they pick out what to format where
based on name instead of argument position. Of course to do this
you need to somehow attach names to the arguments, which is done by
giving % a dictionary instead of its usual tuple.
Here's what this looks like, for people who haven't seen it before:
print "%(fred)d %(barney)d" % {'fred': 1, 'bob': 2, 'barney': 3}
Note that not all keys in the dictionary need to be used in the format string, unlike with positional arguments.
There are two general uses for named string format specifications,
both of which usually start in a situation where the format
specification itself is variable. The simple and straightforward
use is rearranging the order of what gets printed, which can really
come in handy for things like translating messages into different
languages (this is apparently a sufficiently common need that it
got its own feature in Python 3's new string formatting stuff). The
more complex use is to print only a subset of information from a
larger collection of available information. Effectively this makes
'%' string formatting into a little templating system.
My uses of this have tended to be towards full blown templating where the person configuring my program is trusted to write the formatting strings (note that this can at least throw exceptions if they get it wrong). I can see uses for this in simpler setups, for example to log a number of different messages with somewhat different information depending on some combination of things. Rather than write full blown and repetitive code to explicitly emit N variations of the same logging call, you could just select different name-based formatting strings based on the specific circumstances.
(I'll have to remember to experiment with this idea the next time I have this need. It feels like this might be an interesting new approach to deal with the whole issue of verbosity and including or not including certain bits of information and so on, which can otherwise clutter up the code something awful and be annoying to program.)
PS: Python 3's string formatting does this differently. Following my current policy on Python 3 I'm not thinking about it at all.