Wandering Thoughts

2013-06-16

Python 3 has very little benefit for ordinary Python programmers

Sometimes an incompatible transition is strongly justified. In some cases the old code and the old ways were actively dangerous to people because they were too easy to misuse (or were actually basically impossible to use safely); in other cases the baggage of the old was making it essentially impossible to add important new features that people actively wanted.

The Python 3 transition is not one of these. It was almost entirely about removing warts in the language, and here is the thing: ordinary programmers don't really care about language warts. Every language has some warts and in practice those warts rarely get in the way of doing work in the language; people work around them if necessary and often don't even notice them. Removing these warts from Python was (as far as I can tell) not required to make other progress in the language or the standard library. They were just things about the language that irritated the core Python developers.

(Hence, among other things, the comparison of Python 3 to XHTML.)

The big exception to this is also the most prominent and consequential change in Python 3, that of making strings into Unicode by default. But as Python 2's 'from __future__ import unicde_literals' demonstrates, this did not have to be an incompatible change; it could have been put in place in stages.

(In fact anything that is now covered in 'from __future__ import ...' could have been implemented in stages, just as many past Python transitions have been managed.)

Note that this is not the same thing as saying that Python 3 has not brought new and worthwhile things to Python programmers. It certainly has. But as far as I can tell the reason they are only in Python 3 is a choice on the part of the Python developers, not a requirement.

(This idea is not unique to me by any means and I've touched on it in passing before, but today I want to state it explicitly.)

Python3NoBenefit written at 01:53:47; Add Comment

2013-06-14

The core issue in the Python 3 transition (from my view)

In response to my entry about how Python 3 has always made me kind of angry, a commentator asked an interesting question:

You're right. Python developers aren't the first people to deprecate a heavily-used product. Is there anyone who has made an incompatible transition like this whom you can point to as a good example to follow?

[...]

So, what do you do as a developer? What's a better way to shed the cruft of some bad choices while not making people angry and not having to keep adding new features to the code that has given you headaches for so long?

As I read it, this question contains a hidden assumption: that you are going to make an abrupt and thus incompatible transition. I don't think that there's any good way to do this in a language and I don't know of any languages that have managed it gracefully once they got a significant number of users. An incompatible transition by definition creates not one language but two closely related languages, possibly somewhat translateable.

(It's theoretically possible to successfully do a transition like this; what you need is a tool that mechanically rewrites old code to be new, working code. Go actually has such a thing for many language and library changes. 2to3 was not such a tool.)

Such transitions are almost always the result of choices, or really one choice, that of the developers choosing to walk away from the old code. If you refuse to do any work to have a graceful transition then of course you don't get one. This is more or less what I remember the Python developers doing, at least initially; Python 2.7 people had a few limited bones thrown to them but not really all that many. In theory I think it would have been perfectly possible to do a much more graceful transition between Python 2 and Python 3. It just would have taken more time and required adding more things to Python 2 (and probably to its standard library).

(For 'code' one should really read 'language design'. I don't think that the actual CPython code base underwent any particularly major upheavals and rewrites between Python 2.7 and Python 3 and all of the issues that the Python developers say prompted Python 3 were about historical warts in the language.)

There's more I could say on this but in the end none of it matters. The Python developers made a decision that they were not interested in doing further work on Python 2.7 and users of Python could more or less lump it. If the developers are not interested in a graceful transition, you do not get one.

Python3TransitionIssue written at 01:57:58; Add Comment

2013-05-11

The consequences of importing a module twice

Back when I wrote about Python's relative import problem, I mentioned that only actually importing a module once can be important due to Python's semantics. Today I feel like discussing what these are and how much they can matter.

The straightforward thing that goes wrong if you manage to import a module twice (under two different names) is that any code in the module gets run twice, not once. Modules that run active code on import assume that this code is only going to be run once; running it again may result in various sorts of malfunctions.

At one level, modules that run code on import are relatively rare because people understand it's bad form for a simple import to have big side effects. At another level, various frameworks like Django effectively run code on module import in order to handle things like setting up models and view forms and so on; it's just that this code isn't directly visible in your module because it's hiding in framework metaclasses. But this issue is a signpost to the really big thing: function and class definitions are executable statements that are run at import time. The net effect is that when you import a module a second time the new import has a completely distinct set of functions, classes, exceptions, sentinel objects, and so on. They look identical to the versions from the first import but as far as Python is concerned they are completely distinct; fred.MyCls is not the same thing as mymod.fred.MyCls.

(This is the same effect that you get when you use reload() on a module.)

However, my guess is that this generally won't matter. Most Python code uses duck typing and the two distinct classes are identical as far as that goes. Use of things like specific exceptions, sentinel values, and imported classes is probably going to be confined to the modules that directly imported the dual-imported module and thus mostly hidden from the outside world (for example, it's usually considered bad manners to leak exceptions from a module that you imported into the outside world). In many cases even the objects from the imported module are going to be significantly confined to the importing module.

(One potentially bad thing is that if the module has an internal cache of some sort, you will get two copies of the cache and thus perhaps twice the memory use.)

DualImportProblems written at 22:16:08; Add Comment

2013-05-07

Python's relative import problem

Back in this entry I bemoaned the fact that Python's syntax for relative imports ('from . import fred') is only valid inside modules. The reason to have it valid outside modules is fairly straightforward; it would allow you to import and run the same Python code whether or not you were doing 'import module.thing' from outside the module's directory or sitting inside the module's directory doing 'import thing'. The way things are in Python today, once you start using relative imports in your code it can only be used as a module (which has implications for it being somehow on your Python path and so on even while you're coding).

Unfortunately for me, I suspect that this restriction is not arbitrary. The problem that Python is probably worrying about is importing the same submodule twice under different names. The official Python semantics are that there is only one copy of a particular (sub)module and its module level code is run only once, even if the module is imported multiple times; imports after the first one simply return a cached reference.

(These semantics are important in a number of situations that may not be obvious, due to Python's execution model.)

However, Python has opted to do this based on the apparent (full) module name, not based on (say) remembering the file that a particular module was loaded from and not reloading the file. When you do a relative import inside a module, Python knows the full name of the new submodule you're importing (because it knows the full, module-included name of the code doing the relative import). When you do a relative import outside a module, Python has no such knowledge but it knows that in theory this code is part of a module. This opens up the possibility of double-importing a submodule (once under its full name and once under whatever magic name you make up for a non-module relative import). Python opts to be safe and block this by refusing to do a relative import unless it can (reliably) work out the absolute name.

(There are still plenty of ways to import a module twice but they all require you to actively do something bad, like add both a directory and one of its subdirectories to your Python path. Sadly this is quite easy because Python will automatically add things to the Python path for you under some common circumstances.)

RelativeImportProblem written at 00:54:18; Add Comment

2013-04-28

My sysadmin view of Python virtualenvs

It all started with a tweet from Matt Simmons:

Dear #Python devs: I'm reading this: (link) - How are virtualenvs not a security nightmare?

There are certainly many things that can go wrong with virtualenvs, but there are also many things that can go wrong with servers and OS packages (as I tweeted, you can have an obscure one-off server just as easily as you can have an obscure one-off virtualenv). My views on this are that there are both drawbacks and advantages to virtualenvs and to lesser solutions (like installing your own copies of packages outside of the system Python area).

There are three drawbacks of virtualenvs and similar setups. First and foremost, you (the person building the virtualenv) have just become not a sysadmin but an OS distribution vendor in that it is now your job to track security issues and bugs in everything in use in the virtualenv, from the version of Python on up. If you are not plugged into all of these, Matt Simmons is correct and your virtualenv may be a ticking time bomb of security issues.

The second drawback is common to anything that installs packages outside of the standard packaging system; it is the lack of system-wide visibility into what packages (and what versions of them) are installed and in use on the system. If someone hears that there is an important issue with version X of package Y, having a horde of virtualenvs means that there is no simple way to answer the question of 'are we running that?' Relatedly is the issue that you can't just update everyone at once by installing a system package update.

(It follows from these two issues that developers absolutely cannot just bundle up a virtualenv, throw it over the wall to operations, and then forget about it. If you do that you're begging for bad problems down the line.)

The final issue is that if you depend on virtualenvs you may run into problems integrating your software into environments that basically must use the system version of Python. One example is if you develop in a virtualenv and then decide that you want to deploy with Apache's mod_wsgi (perhaps because it is unexpectedly good). Presumably if you start down the virtualenv path you've already thought about this.

Set against this are two significant advantages. The first advantage is that you get the version of everything that you want without having to fight against the system package management system (which leads to serious problems). This is especially useful if you're using one of the OS distributions with long term support, which in practice means that they have obsolete versions of pretty much everything. The second advantage is that you are not at risk of a package update from your OS distribution blowing up your applications. How much of a real risk you consider this depends on how much trust you place in your OS distribution vendor and what sort of changes they tend to make. Some OSes will happily do major package version changes as the 'simplest' way to fix security issues (or just because a new major version came out and should be compatible); some are much more conservative. With virtualenvs you're isolated from this and you can also take a selective, per-application approach to updates, where some applications are okay with the new version (or are sufficiently unimportant that you'll take the risk) and other applications need to be handled very carefully with a lot of testing.

(I haven't used a full-blown virtualenv, but our single Django app uses a private version of Django because the version of Ubuntu LTS we originally deployed it on had a too-old system version. And yes, tracking Django security updates and so on is kind of a pain.)

SysadminVirtualenvView written at 00:11:07; Add Comment

2013-04-14

Python's data structures problem

Python has a problem with data structures. Well, actually it has two, which are closely related to each other.

The first problem is what I illustrated yesterday, namely that there is generally no point in building nice sophisticated data structures in your Python code because they won't perform very well. Priority heaps, linked lists, all sorts of trees (B+, red-black, splay, etc), they're all nifty and nice but generally there's no point in even trying. Unless things are relatively unusual in your problem you'll be just as fast (if not faster) and write less code by just using and abusing Python's native data types. So what if dictionaries and lists (and a few other things that have been implemented at the C level) aren't quite an exact fit for your problem? They're the best you're going to get.

(I've written about this before, although that was more the general version instead of a Python-focused one.)

In theory it might sense to implement your own data structures anyways because they can efficiently support unusual operations that are important to you. In practice my impression is that the performance difference is generally assumed to be large enough that people don't bother doing this unless simpler and more brute force versions are clearly inadequate.

The second problem is that this isn't really true. Data structures implemented in Python code under CPython are slow but other Python implementations can and do make them fast, sometimes even faster than a similar data structure kludged together with native types. But almost everyone writes for CPython and so they're not going to create these alternate data structures that (eg) PyPy could make fast. In fact sometimes they may kludge up data structures that PyPy et al have a very hard time making go fast; they're fine for CPython but pessimal for other environments.

My view is that this matters if we want Python to ever get fast, because getting fast is going to call for data structures that are amenable to optimization instead of what I've seen called hash-map addiction. But I have no (feasible) ideas for how to fix things and I'm pretty sure that asking people to code data structures in CPython isn't feasible until there's a benefit to it even in CPython.

(This is in part a slow reaction to Jason Moiron's What's Going On.)

PythonDataStructuresProblem written at 01:55:35; Add Comment

2013-04-13

Classic linked lists versus Python's list (array) type

For reasons beyond the margins of this entry, let's consider a classic linked list implemented in Python. Because I feel like a traditionalist today we'll built it out of Lisp-style cons cells, using about the most minimal and lightweight implementation we can do:

class Cons(object):
    __slots__ = ('car', 'cdr')
    def __init__(self, car, cdr):
        self.car = car
        self.cdr = cdr
    def __str__(self):
        return '(%s, %s)' % (self.car, self.cdr)

Now let's ask a question: how does the memory use and performance of this compare to just using a Python list (which is not a linked list but instead an array)? I'm going to look purely at building a 1,000 element list element-by-element and I'm going to allow each implementation to append in whatever order is fastest for it. The code:

from itertools import repeat
def native(size):
    l = []
    for _ in repeat(None, size):
        l.append(0)
    return l

def conslst(size):
    h = Cons(0, None)
    for _ in repeat(None, size):
        h = Cons(0, h)
    return h

(itertools.repeat is the fastest way to do this loop.)

On a 32-bit machine the 1,000 element native list takes 4,512 bytes. A Cons cell takes 28 bytes (not counting the size of what it points to for either car or cdr) and so 1,000 of them takes 28,000 bytes, a factor of six or so worse than the list.

As for timings, the Cons-based list construction for a thousand elements is about a factor of five worse than Python native lists on my test machine (if I have GC running). Creating the Cons objects appears to be very cheap and what matters for the runtime is all of the manipulation that goes on around them. Creating shorter lists is somewhat better, creating longer ones is worse.

(Since I checked just to be sure, I can tell you that a version of Cons that doesn't use __slots__ runs somewhat more slowly.)

Right now some of my readers are rolling their eyes and telling me that of course the Cons version is worse and always was going to be worse; that's why everyone uses native Python lists. That is sort of the point of this exercise, but that's another entry.

Sidebar: how to make the Cons version go fast

The short answer is to use PyPy. Run under PyPy, conslst() is clearly somewhat faster than native(), even with larger lists. Both versions run drastically faster than the plain Python version (which is what you'd hope for, since both ought to be highly optimizable). Unfortunately there are plenty of environments that are not PyPy-enabled and probably won't be for some time (for example, all embedded web server environments like mod_wsgi and UWSGI).

LinkedListCost written at 01:36:18; Add Comment

2013-03-22

The problem with trying to make everything into a Python module

One of the reasons for Django's unpleasant project restructuring is that they want your website directory (ie the directory that your project sits in) to be a module that can be imported. This in fact seems to be somewhat of a general trend; all sorts of things rather want you to to have not just a collection of files in a directory but an actual module. I wish they'd stop. Modules are not the be all and end all in Python, at least not as currently implemented, and not everything needs or wants to be a module.

The general reason for making things into modules is namespaces for imports. If you're sitting in your project's directory and do 'import fred', in theory this is ambiguous; you might mean your fred.py or you might mean some global fred module installed in Python. The absolute form of 'import mystuff.fred' is more or less unambiguous.

(This preference for modules also goes with the fact that the relative import syntax, 'from . import fred', is only valid in an actual module. I think that this is a terrible mistake, but no one asked me for my opinion.)

I have no problem with modules as such. The problem I have is how you get a directory to be a module, namely that you add the directory's parent to the Python search path (in one of a number of ways), and then the directory becomes a module (or technically I think a package) called its directory name. This is bad in at least two ways. It tightly couples together the directory name and the module name and it also makes everything else in the directory's parent available as a potential module. What both of these have in common is undesired name collisions. For example, you cannot be working on two versions of a 'fred' module that are sitting in a directory as, say, src/fred-1 and src/fred-2, not unless you want to have a src/fred symlink that you keep changing back and forth.

(The natural structure seems to be to isolate each module in its own artificial parent directory (eg src/fred-1/fred) or to ignore the whole issue, put everything in src/, and assume you will never have any collisions or be developing a new version of fred that you don't want src/bob getting when it does an 'import fred'.)

What would make this situation okay is a simple way to tell Python 'directory X is module Y', where 'X' might be '.' (the current directory). This should be available both on the Python command line and from inside Python code. Sadly I don't expect this to arrive any time soon.

(This stuff irritates me for reasons that are hard to pin down. Partly it just feels wrong (eg '/src' or wherever isn't a directory of modules, so why am I telling Python that it is?).)

EverythingModuleProblem written at 00:18:26; Add Comment

2013-03-17

Argument validation using functions

There's a pattern (or perhaps an anti-pattern) that I keep inventing in my programs. I start out with a bunch of commands (or macros or template text renderers or the like) that can take arguments (registered somehow), and I have all of the functions do their own argument count validation. But this is repetitive, so I start having the central dispatching code do some checks on the argument count. But there are always special cases (one command might take exactly N arguments, another takes M to N, another takes at least N but maybe more, and so on), so pretty soon I start trying to encode all of this in increasing baroque special meanings for various sorts of argument counts ('if it's negative, it means...').

In thinking about this recently (as part of some DWiki changes I'm thinking about) I've realized another approach, hopefully a better one. Instead of trying yet another crazy encoding scheme, I can use functions to validate the argument count. Instead of registering the argument count, register a function that validates the argument count. These functions (or callable objects) will of course be created by argument count validation factories, so I will write code like:

register("fred", fredfunc, noMoreThan(3))
register("brad", bradfunc, betweenCnt(2, 4))
register("barney", barnfunc, anyOf(0, 1, 3, 5))

The great attraction of this approach to me is that it completely decentralizes the encoding scheme for argument validation (and thus the complexity of argument validation entirely). The central dispatch function simply calls the validation function and doesn't care any further; all of the huge variety of possible arguments necessary is delegated to the code that creates any particular validation function. I can have any sort of validation ranging from very generic to completely custom, whatever makes the most sense, and none of the complexity of that shows up outside of code that actually uses it.

This is also completely expandable. New forms of argument validation just need new functions, they don't need any changes in the central dispatch system to understand and handle yet another special case. This is an attractive property for me since I never know just what sort of arguments I'm going to need until I actually write a particular command (or whatever) handler.

Obviously, this can be extended to also validate various properties of the arguments (for example, you might know that the first argument of a particular command has to be a file). When you reach this sort of extended argument validation I start to think that you want something like an ArgValidator class which you instantiate and then start adding restrictions to (otherwise you have a rapidly exploding number of combinations of various options; basically you want some way of easily composing separate restrictions together instead of having to hard code them).

ArgCheckingViaFunctions written at 01:05:48; Add Comment

2013-02-28

A decorator for decorators that accept additional arguments

Once you're happy with functions returning other functions (technically closures), basic ordinary decorators in Python are easy enough to use, write, and understand. You get things like:

from functools import wraps
def trace(func):
    @wraps(func)
    def _d(*args, **kwargs):
        print "start", func.__name__
        try:
            return func(*args, **kwargs)
        finally:
            print "end", func.__name__
    return _d

@trace
def jim(a, b):
    print "in jim"
    return a + b

(If you're decorating functions that all have the same argument signature, you can make the _d() closure take the specific arguments and pass them on instead of using the args and **kwargs approach.)

But sometimes you have decorators that want to take additional arguments (over and above the function that they get handed to decorate). The syntax for doing this when you're declaring functions that will get decorated is easy and obvious:

@tracer("barney")
def fred(a, b, c):
    print "in fred:", a
    return b - c

Unfortunately the niceness stops there; an implementation of tracer() is much more complicated than trace(). Because of how Python has defined things, tracer() is no longer a decorator but a decorator factory, something that when called creates and returns the decorator that will actually be applied to fred(). You wind up with functions returning functions that return functions, or with tracer() actually being a class (so that calling it creates an instance that when called will actually do the decorating).

What we would like is for tracer() to be a regular decorator that just has extra arguments. Well, we can do that; all we need is another decorator that we will use on tracer() itself. Like so:

from functools import partial
def decor(decorator):
    @wraps(decorator)
    def _dd(*args, **kwargs):
        return partial(decorator, *args, **kwargs)
    return _dd

@decor
def tracer(note, func):
    fname = func.__name__
    @wraps(func)
    def _d(*args, **kwargs):
        print "-> %s: start %s" % (note, fname)
        try:
            return func(*args, **kwargs)
        finally:
            print "-> %s: end %s" % (note, fname)
    return _d

(You can make tracer()'s arguments have the function to be decorated first, but then you have to do more work because you can't use functools.partial(). While I think that func belongs as the first argument, I don't quite feel strongly enough to give up partial().)

The one nit with this is that positional arguments really are positional and keyword arguments really are keywords. You can't, for example, write:

@tracer(note="greenlet")
def fred(....):
    ....

(The only way around this is changing the order of arguments to tracer() so that func is first, which means giving up the convenience of just using partial().)

I've been thinking about decorators lately (they're probably the right solution for a code structure problem) and had this issue come up in my tentative design, so I felt like writing down my solution for later use. I'm sure that regular users of decorators already know all of these tricks.

(Decorators are one of the Python things I should use more. I don't for complex reasons that involve my history with significant Python coding.)

DecoratorDecorator written at 23:25:57; Add Comment

These are my WanderingThoughts
(About the blog)

GettingAround
Full index of entries
Recent comments

This is part of CSpace, and is written by ChrisSiebenmann.
Twitter: @thatcks

* * *

Atom feeds are available; see the bottom of most pages.

This is a DWiki.
(Help)

Categories: links, linux, programming, python, snark, solaris, spam, sysadmin, tech, unix, web

Search:
(Previous 10 or go back to February 2013 at 2013/02/24)

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.