Why Python uses bytecode (well, probably): it's simpler

May 22, 2014

A while back I read Why are there so many Pythons?, which in passing talks about the Python's internal use of bytecode and says:

In very brief terms: machine code is much faster, but bytecode is more portable and secure.

If you replace 'is' with 'can be', this is true. But it's not the reason that the main implementation of Python (hereafter 'CPython') uses bytecode. One clue that this isn't the case is that the .pyc files of bytecodes are not all that portable; they can differ between Python versions and possibly even different types of machines with the same version of CPython.

Put simply, CPython almost certainly uses bytecode because creating and then interpreting bytecode is a common implementation technique for writing reasonably complex interpreters. All interpreters need to parse their source language and then interpret (and execute) something, but it's often simpler (and faster) to transform the initial source language into some simpler format before interpreting it. A common 'simpler format' is some form of abstract bytecode, often extremely specialized to the language being interpreted and also how data is stored inside the interpreter.

(On modern CPUs, another advantage of transforming things from an abstract syntax tree of the parsed code to bytecode is that the bytecode can be made linear in memory and thus more cache friendly. Modern CPUs really don't like bouncing around all over to follow pointers; the less you do it the better.)

CPython's bytecode is just this sort of abstract bytecode. In theory you could describe it as a simple stack machine, but in practice the stack is only used for storing temporary values while computing expressions and so on. Actual Python variables and so on are accessed through a whole series of specialized bytecode instructions. The bytecode also has special instructions for things like creating instances of standard types like lists and dealing with iterators, none of which can be described as either general-purpose outside of Python or anything like a real computer.

(And sometimes the exact details of this bytecode matter.)

As for security, Python bytecode is not necessarily all that secure by itself. While it doesn't allow you to perform random machine operations, I wouldn't be surprised if hand-generating crazy instruction sequences could do things like crash CPython (in fact, I'm pretty confidant that doing this is relatively trivial) and lead to arbitrary code execution. The CPython bytecode interpreter is not intended as a general interpreter but instead as an interpreter for bytecode generated by CPython itself, which is guaranteed to obey the rules and not do things like attempt to get or set nonexistent function local variables.

Or to put it directly: it is not safe at all to have CPython run untrusted bytecode, even in a theoretically relatively captive environment. This is completely independent of what access the bytecode might have (or be able to contrive) to standard library functions like file and network access. Untrusted bytecode doesn't need access to stuff like that to wreak havoc.

(I can't be absolutely sure that this is why CPython uses bytecode because I haven't asked the Python developers about it, but I would be truly surprised if it was any other reason. Compiling to bytecode and interpreting the bytecode is a classic and standard interpreter implementation technique and CPython itself is a pretty classic implementation of it.)

Written on 22 May 2014.
« How I wish ZFS pool importing could work
Why Java is a compiled language and Python is not »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Thu May 22 02:27:53 2014
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.