Understanding hashing in Python
I've talked about the
__hash__ special method a fair bit in
here (and a note in the middle of this), but today it's time to actually write enough background
so that people can actually understand why you might want to write a
special method for it, what you need to do, and what the whole story is.
The first and biggest question is simple; why is it interesting to give
objects what I called a 'hash identity'?
The motivation for hash identities is what I'll call the dictionary key problem. Ignoring the actual implementation of dictionaries for a moment, the core operation in dictionaries is looking up an entry based on a key. If the only thing dictionaries could do with keys was an equality check (is the user-supplied key the same as the key for this entry), such lookups would be extremely inefficient; for example, you would have to do an equality check on every key in the dictionary to find out that you didn't have an entry for a particular key. Hash identities are the answer. Instead of checking for equality with every entry's key, dictionaries only have to check the keys with the same hash identity. If things work out (ie, there are only a few keys in the dictionary with the same hash identity), this will be a lot fewer entries and thus the lookup will run much faster.
(Fast lookups are useful for other data structures in addition to dictionaries. For example, sets use hash identities for this purpose, although I don't know if it's explicitly documented.)
That's nicely theoretical, but there's more. Python dictionaries are hash tables, which means that they use hash identities for more than speeding up key lookups; hash identities are how the internal data structures are organized. Hash identities are literally how dictionaries find entries. This means that hash identities for keys in a dictionary can never change. If they do change, the dictionary probably can't find the entry any more. This is one of the few ways of making a Python dictionary malfunction.
(If this happens, the only way to get the items again is with the
.items() method. You can't use the key to delete or retrieve the item,
even if you get the key from the dictionary with
This leads to the two rules of hash identities:
hash(a)should never change.
bshould be considered the same item,
a == band
hash(a) == hash(b)should both be true.
(The simplified but often accurate version of this is that
a == bshould usually mean that
hash(a) == hash(b).)
As covered earlier, plain equality comparisons are kind of a minefield for hash identities. This is because there are really three sorts of objects: unhashable objects, objects with the default hash identity, and objects with custom hash identities.
Unhashable objects have a
class A(object): __hash__ = None
Such objects can't be used as keys in dictionaries, put into sets, or anything else that wants to use hash identities; if you try, you'll get an exception about it being unhashable. Because they're unhashable, these objects can have whatever equality comparison they want to.
(It looks like defining some of the comparison methods but not a
__hash__ can also make your objects unhashable, but I'm not
clear about the exact circumstances here. The documentation
claims that defining
__cmp__ but not
is enough to make your objects unhashable, but this is not what I
see in Python 2.7.2.)
Objects with the default hash identity have their hash being
their unique (current) memory address. If you're using only such objects
as keys in a dictionary, it's guaranteed that no two keys will have the
same hash identity and so it doesn't matter if two keys would compare
equal to each other (the dictionary will never make that equality
check). It is thus generally safe to make such objects have 'broad'
equality comparisons, comparisons that are willing to say that the
object is equal to, say, a number or a string.
(Even in non-CPython implementations, Python is going to guarantee that an object's default hash identity doesn't change over the lifetime of your program. Doing otherwise breaks too many things.)
Of course the problem with objects with the default hash identity is that they often aren't really useful as dictionary keys. For useful dictionary keys, we usually want this to work:
a = AnObject(1, 2) dct[a] = 10 b = AnObject(1, 2) assert(a is not b and a == b) print dct[b]
To make this work,
AnObject needs a custom hash identity so that
b hash to the same value. A lot of normal Python types have custom
hash identities; for example, numbers and strings of all sorts have
custom hash identities (numbers hash to themselves, for example). The
moment you use things with custom hash identities as dictionary keys,
you can have hash identity collisions. This is where you want to be very
certain that your equality comparisons really mean 'a is the same thing
as b', which means that you should not have an equality comparison that
is willing to declare your object equal to an object of another class
(including to numbers, strings, and so on).
(Note that this holds even for classes with a default hash identity, if you mix them as keys in a dictionary with keys that have a custom hash identity.)
The rule of thumb for what goes into a custom hash identity is clear:
it's whatever makes
a be the same thing as
b. Given that the
hash identity of objects can't change over their lifetime (otherwise
dictionaries explode), this implies that those parts that create
equality can't change either. This creates the usual Python rule
that objects with custom hash identities should be immutable once
(There are a number of ways to violate this rule without having your dictionaries explode, but I think that all of them are going to result in peculiar results. So don't do that.)
When writing a custom hash function it's important to follow the usual rule for hash functions that the output values should be evenly distributed across the output range (given the expected input values, whatever they are for your objects). In Python, you should be evenly distributed across at least 31-bit integer values.
Sidebar: hash identities and long integers
__hash__ is documented as returning an integer. Given the
unification between integers and long integers, you might wonder if
(C)Python really means 'a machine integer' or 'any integer, even a long
integer'. The answer is that as of Python 2.5, CPython will accept
__hash__ that returns a long integer but silently converts
the returned value to some smaller value. Since the conversion is
unpredictable, I think that in practice you should confine the values
__hash__ returns to a safe range; today, the safest range
is probably 31-bit integers.
(Note that the conversion is not as simple as truncating your long integer to its lowest N bits, where N is 32 or 64 or whatever depending on the CPython platform.)