Wandering Thoughts archives

2013-04-28

My sysadmin view of Python virtualenvs

It all started with a tweet from Matt Simmons:

Dear #Python devs: I'm reading this: (link) - How are virtualenvs not a security nightmare?

There are certainly many things that can go wrong with virtualenvs, but there are also many things that can go wrong with servers and OS packages (as I tweeted, you can have an obscure one-off server just as easily as you can have an obscure one-off virtualenv). My views on this are that there are both drawbacks and advantages to virtualenvs and to lesser solutions (like installing your own copies of packages outside of the system Python area).

There are three drawbacks of virtualenvs and similar setups. First and foremost, you (the person building the virtualenv) have just become not a sysadmin but an OS distribution vendor in that it is now your job to track security issues and bugs in everything in use in the virtualenv, from the version of Python on up. If you are not plugged into all of these, Matt Simmons is correct and your virtualenv may be a ticking time bomb of security issues.

The second drawback is common to anything that installs packages outside of the standard packaging system; it is the lack of system-wide visibility into what packages (and what versions of them) are installed and in use on the system. If someone hears that there is an important issue with version X of package Y, having a horde of virtualenvs means that there is no simple way to answer the question of 'are we running that?' Relatedly is the issue that you can't just update everyone at once by installing a system package update.

(It follows from these two issues that developers absolutely cannot just bundle up a virtualenv, throw it over the wall to operations, and then forget about it. If you do that you're begging for bad problems down the line.)

The final issue is that if you depend on virtualenvs you may run into problems integrating your software into environments that basically must use the system version of Python. One example is if you develop in a virtualenv and then decide that you want to deploy with Apache's mod_wsgi (perhaps because it is unexpectedly good). Presumably if you start down the virtualenv path you've already thought about this.

Set against this are two significant advantages. The first advantage is that you get the version of everything that you want without having to fight against the system package management system (which leads to serious problems). This is especially useful if you're using one of the OS distributions with long term support, which in practice means that they have obsolete versions of pretty much everything. The second advantage is that you are not at risk of a package update from your OS distribution blowing up your applications. How much of a real risk you consider this depends on how much trust you place in your OS distribution vendor and what sort of changes they tend to make. Some OSes will happily do major package version changes as the 'simplest' way to fix security issues (or just because a new major version came out and should be compatible); some are much more conservative. With virtualenvs you're isolated from this and you can also take a selective, per-application approach to updates, where some applications are okay with the new version (or are sufficiently unimportant that you'll take the risk) and other applications need to be handled very carefully with a lot of testing.

(I haven't used a full-blown virtualenv, but our single Django app uses a private version of Django because the version of Ubuntu LTS we originally deployed it on had a too-old system version. And yes, tracking Django security updates and so on is kind of a pain.)

SysadminVirtualenvView written at 00:11:07; Add Comment

2013-04-14

Python's data structures problem

Python has a problem with data structures. Well, actually it has two, which are closely related to each other.

The first problem is what I illustrated yesterday, namely that there is generally no point in building nice sophisticated data structures in your Python code because they won't perform very well. Priority heaps, linked lists, all sorts of trees (B+, red-black, splay, etc), they're all nifty and nice but generally there's no point in even trying. Unless things are relatively unusual in your problem you'll be just as fast (if not faster) and write less code by just using and abusing Python's native data types. So what if dictionaries and lists (and a few other things that have been implemented at the C level) aren't quite an exact fit for your problem? They're the best you're going to get.

(I've written about this before, although that was more the general version instead of a Python-focused one.)

In theory it might sense to implement your own data structures anyways because they can efficiently support unusual operations that are important to you. In practice my impression is that the performance difference is generally assumed to be large enough that people don't bother doing this unless simpler and more brute force versions are clearly inadequate.

The second problem is that this isn't really true. Data structures implemented in Python code under CPython are slow but other Python implementations can and do make them fast, sometimes even faster than a similar data structure kludged together with native types. But almost everyone writes for CPython and so they're not going to create these alternate data structures that (eg) PyPy could make fast. In fact sometimes they may kludge up data structures that PyPy et al have a very hard time making go fast; they're fine for CPython but pessimal for other environments.

My view is that this matters if we want Python to ever get fast, because getting fast is going to call for data structures that are amenable to optimization instead of what I've seen called hash-map addiction. But I have no (feasible) ideas for how to fix things and I'm pretty sure that asking people to code data structures in CPython isn't feasible until there's a benefit to it even in CPython.

(This is in part a slow reaction to Jason Moiron's What's Going On.)

PythonDataStructuresProblem written at 01:55:35; Add Comment

2013-04-13

Classic linked lists versus Python's list (array) type

For reasons beyond the margins of this entry, let's consider a classic linked list implemented in Python. Because I feel like a traditionalist today we'll built it out of Lisp-style cons cells, using about the most minimal and lightweight implementation we can do:

class Cons(object):
    __slots__ = ('car', 'cdr')
    def __init__(self, car, cdr):
        self.car = car
        self.cdr = cdr
    def __str__(self):
        return '(%s, %s)' % (self.car, self.cdr)

Now let's ask a question: how does the memory use and performance of this compare to just using a Python list (which is not a linked list but instead an array)? I'm going to look purely at building a 1,000 element list element-by-element and I'm going to allow each implementation to append in whatever order is fastest for it. The code:

from itertools import repeat
def native(size):
    l = []
    for _ in repeat(None, size):
        l.append(0)
    return l

def conslst(size):
    h = Cons(0, None)
    for _ in repeat(None, size):
        h = Cons(0, h)
    return h

(itertools.repeat is the fastest way to do this loop.)

On a 32-bit machine the 1,000 element native list takes 4,512 bytes. A Cons cell takes 28 bytes (not counting the size of what it points to for either car or cdr) and so 1,000 of them takes 28,000 bytes, a factor of six or so worse than the list.

As for timings, the Cons-based list construction for a thousand elements is about a factor of five worse than Python native lists on my test machine (if I have GC running). Creating the Cons objects appears to be very cheap and what matters for the runtime is all of the manipulation that goes on around them. Creating shorter lists is somewhat better, creating longer ones is worse.

(Since I checked just to be sure, I can tell you that a version of Cons that doesn't use __slots__ runs somewhat more slowly.)

Right now some of my readers are rolling their eyes and telling me that of course the Cons version is worse and always was going to be worse; that's why everyone uses native Python lists. That is sort of the point of this exercise, but that's another entry.

Sidebar: how to make the Cons version go fast

The short answer is to use PyPy. Run under PyPy, conslst() is clearly somewhat faster than native(), even with larger lists. Both versions run drastically faster than the plain Python version (which is what you'd hope for, since both ought to be highly optimizable). Unfortunately there are plenty of environments that are not PyPy-enabled and probably won't be for some time (for example, all embedded web server environments like mod_wsgi and UWSGI).

LinkedListCost written at 01:36:18; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.