2012-04-03
Python's two versions of equality, with a long digression on hash()
Python has three two versions of equality; in order of
pickiness, they are ==
(plain equality) and is
identity. When I
started writing this entry I was going to say that there were three and
the third was hash identity, but I am wrong about that. We'll talk about
hash identity at the end.
Plain equality is actually surprisingly complicated if you want it
to be. A class can control how all of the comparison operators work
with special methods (aka 'magic methods'), and it can make them
behave differently if it wants to. Or you can just define an old style
__cmp__
method and be done with it. ==
can compare different types
of objects and return useful results if desired. Two different objects
can and often do compare as equal to each other.
Actual is
identity, as in 'a is b
', tests whether the two things
are in fact the exact same object. This is not something that the
class can control using special methods; it's directly implemented
inside the interpreter (as is 'id()
'). It's legitimate and common for
'a == b
' to be true but 'a is b
' to be false. On the other hand,
sometimes it's true even if you got the objects from different places. Two different objects can never be is-identical
to each other, pretty much by definition.
What I have called 'hash identity' is the value that 'hash(obj)
'
returns, if it returns anything. Hash identity is used in dictionaries
and similar types to find things inside the dictionary, both to retrieve
them and to check to see if the key you are trying to set already
exists in the dictionary. An object's hash identity doesn't have to
be unique and in fact isn't necessarily unique, which means that you
can't directly use hash identity as an equality check (although, like
dictionaries, you can use it to see if a full equality check is worth
doing). A class can control the hash value of objects by defining a
__hash__
special method, but you need to be very careful about how
it's defined.
(I wouldn't be surprised if hash identities are also used in things like
sets in order to enable relatively efficient 'a in set
' checks and the
like.)
You might ask why hash methods are necessary; the answer is that
they enable your objects to be useful keys in dictionaries in some
circumstances. Consider a complex number class (one where the numbers
are immutable once created). Two complex number objects with the same
real and imaginary parts should compare equal, and more than that if you
do 'dct[Complex(1, -1)] = 10; print dct[Complex(1, -1)]
' you should
clearly get 10 instead of 'no such key'. In order to do this you need a
custom hash function so that all Complex instances with the same real
and imaginary values have the same hash value.
(Although it's commonly said that custom hash functions should only be defined for immutable objects, this is not quite the true requirement. You can have mutable objects with custom hash functions provided that the mutation you can do does not change either the hash function's result or the equality comparison. You may get some odd results from dictionary operations, but presumably you know what you're doing.)
Now we get into here be dragons territory: you should be very careful about combining custom hash functions with custom equality functions that are willing to do comparisons with things outside your class. The easiest way to show what can go wrong is with a simple example:
class Boom(object): def __init__(self, x): self.x = x def __eq__(self, other): if other == "YES": return True else: return other.x == self.x def __hash__(self): return self.x d = {} d[Boom(hash("YES"))] = 10 print d["YES"]
This should get a KeyError, because we sure didn't add a "YES" to the
dictionary. But because our Boom()
object both has the same hash value
as "YES" and compares equal to it the dictionary thinks that it is the
same key and says 'sure, here you go', and our example prints 10
.
(You can argue that Python dictionaries should also check the class of
the key objects they are comparing and only do the ==
comparison if
they are of the same class, but the plain fact is that CPython doesn't
behave this way today. There are probably obscure legitimate uses for
this.)
PS: yes, this is one of the places where duck typing might go badly
wrong (cf the limits of duck typing). Despite my
usual opposition to isinstance()
checks, this is a case where I might
accept one. Remember to put in a comment about it.
(For more on the various special methods involved in this and what you can do with them, you can see the official documentation.)