Why you should support specifying comparison keys (and functions)
I've recently been taking another pass at optimizing DWiki's code a bit. In the process of profiling things, I noticed some code that was doing linear searches through an ordered list (for reasons that made sense at the time). What my code was doing just cried out for a better bisection-based approach, and it looked like I was in luck; not only does Python have a bisect module in the standard library but CPython actually has a C-level implementation for the best efficiency you could ask for.
Unfortunately I can't use it. You see, the people who created this
module hard-coded everything to use direct comparisons so the code is
full of expressions like 'if x < a[mid]: ...
', and what I want to
bisect through is not comparable this way. Like many people, I have a
list where each element is itself a tuple and I want to search through
the list based on one field of those tuples.
The bisect module has two failures here. The direct failure is that it
missed a fundamental rule of general Python code: any time you write
a searching or sorting routine, you should support specifying the key's
field. Ideally you'd support a general comparison function as well.
The .sort()
method on lists is a good example (although Python 3
drops the comparison function).
The indirect failure is that this code has misunderstood what the fundamental data structures of Python are. The bisect module would sort of make sense if the fundamental data structure was an object; then your bisectable lists would be made up of objects and your objects would have comparisons defined on them and everything would be fine. But this is not the case. The fundamental data structures in Python are the primitive types; lists, tuples, and dictionaries. Almost no one bothers to make objects where one of these three will do. Any generic code that deals with ordered comparisons (searching, sorting, inserting in order in some data structure like a B-tree, and so on) should expect to be routinely handed these primitive types and for people want to order things based on a particular field in them; failing to handle this very common situation is a failure in your code.
(Well, this is half the reason to support specifying the key's field. The other half is that it's perfectly reasonable to want to order the same objects in several different ways depending on what you're doing with them at the moment.)
Sidebar: getting around the bisect module with a club
Suppose that you are very irritated with the bisect module's limitation here and are willing to commit evil hacks. Then here is basic code for something that I will call a virtual Schwartzian transformation:
class VirtualList(object): def __init__(self, lst, key): self.lst = lst self.key = key def __len__(self): return len(self.lst) def __getitem__(self, idx): return self.lst[idx][self.key]
Take your list of whatever-it-is you have, make a virtual list with
'vlst = VirtualList(lst, KEY)
', and then bisect vlst
; the index
position you get back is valid for lst
. KEY does not have to be a
number; it works perfectly fine to have a list of dictionaries (or more
complex objects). This only really makes sense for bisection, binary
searching, and so on, where you can have a large list but should only
ever look at a few elements from it.
You can extend this approach to support custom comparison functions by using a dynamic class and materializing instances on demand, but I'll leave that as an exercise for the sufficiently interested and perverse reader.
|
|