Why you should support specifying comparison keys (and functions)

January 31, 2013

I've recently been taking another pass at optimizing DWiki's code a bit. In the process of profiling things, I noticed some code that was doing linear searches through an ordered list (for reasons that made sense at the time). What my code was doing just cried out for a better bisection-based approach, and it looked like I was in luck; not only does Python have a bisect module in the standard library but CPython actually has a C-level implementation for the best efficiency you could ask for.

Unfortunately I can't use it. You see, the people who created this module hard-coded everything to use direct comparisons so the code is full of expressions like 'if x < a[mid]: ...', and what I want to bisect through is not comparable this way. Like many people, I have a list where each element is itself a tuple and I want to search through the list based on one field of those tuples.

The bisect module has two failures here. The direct failure is that it missed a fundamental rule of general Python code: any time you write a searching or sorting routine, you should support specifying the key's field. Ideally you'd support a general comparison function as well. The .sort() method on lists is a good example (although Python 3 drops the comparison function).

The indirect failure is that this code has misunderstood what the fundamental data structures of Python are. The bisect module would sort of make sense if the fundamental data structure was an object; then your bisectable lists would be made up of objects and your objects would have comparisons defined on them and everything would be fine. But this is not the case. The fundamental data structures in Python are the primitive types; lists, tuples, and dictionaries. Almost no one bothers to make objects where one of these three will do. Any generic code that deals with ordered comparisons (searching, sorting, inserting in order in some data structure like a B-tree, and so on) should expect to be routinely handed these primitive types and for people want to order things based on a particular field in them; failing to handle this very common situation is a failure in your code.

(Well, this is half the reason to support specifying the key's field. The other half is that it's perfectly reasonable to want to order the same objects in several different ways depending on what you're doing with them at the moment.)

Sidebar: getting around the bisect module with a club

Suppose that you are very irritated with the bisect module's limitation here and are willing to commit evil hacks. Then here is basic code for something that I will call a virtual Schwartzian transformation:

class VirtualList(object):
    def __init__(self, lst, key):
        self.lst = lst
        self.key = key

    def __len__(self):
        return len(self.lst)

    def __getitem__(self, idx):
        return self.lst[idx][self.key]

Take your list of whatever-it-is you have, make a virtual list with 'vlst = VirtualList(lst, KEY)', and then bisect vlst; the index position you get back is valid for lst. KEY does not have to be a number; it works perfectly fine to have a list of dictionaries (or more complex objects). This only really makes sense for bisection, binary searching, and so on, where you can have a large list but should only ever look at a few elements from it.

You can extend this approach to support custom comparison functions by using a dynamic class and materializing instances on demand, but I'll leave that as an exercise for the sufficiently interested and perverse reader.

Written on 31 January 2013.
« Thinking about FreeBSD versus Illumos for our ZFS fileservers
The shifting SBL, as experienced here »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Thu Jan 31 02:19:02 2013
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.