Using constant Python hash functions for fun and no real profit

December 23, 2020

In one of the examples of wtfpython, the author uses a constant __hash__ function in order to make a version of plain dicts and ordered dicts that can be put in a set. When I saw this, I had some reactions.

My first reaction was to wonder if this was safe. With a lot of qualifications, the answer is yes. Two important qualities of a __hash__ function are that it always return the same result for a given object and that it returns the same hash for any two objects that potentially compare the same (see also understanding hashing in Python). Returning a constant (here '0') makes both trivially true, provided that your objects cannot be equal to anything other than other instances of your class (or classes). Returning a constant hash for instances that aren't going to compare equal is safe, as object hashes don't have to be unique.

(This doesn't mean that you can safely mutate instances of your classes in ways that affect their equality comparison. Doing so is a great way to get two copies of the same key in a dict or a set, which is likely to be bad.)

My second reaction was to wonder if this was useful, and I think the answer is generally not really. The problem with a constant hash function is that it's going to guarantee dictionary key collisions for any such objects that you add to the dict or set. If you put very many objects with the same key into a dict (or a set), checking for a given key turns into doing an equality check on all of the other keys you've already added. Adding an entry, getting an entry, checking whether an entry is there, whatever, they all become a linear search.

If you don't have very many objects in total in a dict this is probably okay. A linear search through ten or twenty objects is not terrible (hopefully the equality check itself is efficient). Even a linear search through a hundred might be tolerable if it's important enough. But after a certain point you're going to see visible and significant slowdowns, and it would be more honest to use a list instead of a dict or set (since you're effectively getting the performance of a list).

If you need to do better, you probably want to go all of the way to implementing some sort of proper hash function that implements the rules of hashing in Python. If you're willing to live daringly, you don't have to make your objects literally immutable once created, you just have to never mutate them while they're in a dict or a set.


Comments on this page:

By nknight at 2020-12-24 18:36:40:

But after a certain point you're going to see visible and significant slowdowns, and it would be more honest to use a list instead of a dict or set (since you're effectively getting the performance of a list).

I'd think far worse, practically speaking. hash+deref is strictly more work than a standalone dereference, and dict inevitably has additional memory overhead.

I can't really think of any case where this has material advantages over saner alternatives. Desperately needing a dict in a set is unusual enough that I'd rather just create an immutable dict-like class specific to the need with a real hashing strategy and proper documentation. It's no great feat, and whoever comes after me would be less likely to accidentally blow their foot off maintaining my code.

Written on 23 December 2020.
« The legibility of different versions of ZFS
In CPython, types implemented in C actually are part of the type tree »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed Dec 23 00:27:47 2020
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.