2013-04-14
Python's data structures problem
Python has a problem with data structures. Well, actually it has two, which are closely related to each other.
The first problem is what I illustrated yesterday, namely that there is generally no point in building nice sophisticated data structures in your Python code because they won't perform very well. Priority heaps, linked lists, all sorts of trees (B+, red-black, splay, etc), they're all nifty and nice but generally there's no point in even trying. Unless things are relatively unusual in your problem you'll be just as fast (if not faster) and write less code by just using and abusing Python's native data types. So what if dictionaries and lists (and a few other things that have been implemented at the C level) aren't quite an exact fit for your problem? They're the best you're going to get.
(I've written about this before, although that was more the general version instead of a Python-focused one.)
In theory it might sense to implement your own data structures anyways because they can efficiently support unusual operations that are important to you. In practice my impression is that the performance difference is generally assumed to be large enough that people don't bother doing this unless simpler and more brute force versions are clearly inadequate.
The second problem is that this isn't really true. Data structures implemented in Python code under CPython are slow but other Python implementations can and do make them fast, sometimes even faster than a similar data structure kludged together with native types. But almost everyone writes for CPython and so they're not going to create these alternate data structures that (eg) PyPy could make fast. In fact sometimes they may kludge up data structures that PyPy et al have a very hard time making go fast; they're fine for CPython but pessimal for other environments.
My view is that this matters if we want Python to ever get fast, because getting fast is going to call for data structures that are amenable to optimization instead of what I've seen called hash-map addiction. But I have no (feasible) ideas for how to fix things and I'm pretty sure that asking people to code data structures in CPython isn't feasible until there's a benefit to it even in CPython.
(This is in part a slow reaction to Jason Moiron's What's Going On.)