Python's two versions of equality, with a long digression on hash()

April 3, 2012

Python has three two versions of equality; in order of pickiness, they are == (plain equality) and is identity. When I started writing this entry I was going to say that there were three and the third was hash identity, but I am wrong about that. We'll talk about hash identity at the end.

Plain equality is actually surprisingly complicated if you want it to be. A class can control how all of the comparison operators work with special methods (aka 'magic methods'), and it can make them behave differently if it wants to. Or you can just define an old style __cmp__ method and be done with it. == can compare different types of objects and return useful results if desired. Two different objects can and often do compare as equal to each other.

Actual is identity, as in 'a is b', tests whether the two things are in fact the exact same object. This is not something that the class can control using special methods; it's directly implemented inside the interpreter (as is 'id()'). It's legitimate and common for 'a == b' to be true but 'a is b' to be false. On the other hand, sometimes it's true even if you got the objects from different places. Two different objects can never be is-identical to each other, pretty much by definition.

What I have called 'hash identity' is the value that 'hash(obj)' returns, if it returns anything. Hash identity is used in dictionaries and similar types to find things inside the dictionary, both to retrieve them and to check to see if the key you are trying to set already exists in the dictionary. An object's hash identity doesn't have to be unique and in fact isn't necessarily unique, which means that you can't directly use hash identity as an equality check (although, like dictionaries, you can use it to see if a full equality check is worth doing). A class can control the hash value of objects by defining a __hash__ special method, but you need to be very careful about how it's defined.

(I wouldn't be surprised if hash identities are also used in things like sets in order to enable relatively efficient 'a in set' checks and the like.)

You might ask why hash methods are necessary; the answer is that they enable your objects to be useful keys in dictionaries in some circumstances. Consider a complex number class (one where the numbers are immutable once created). Two complex number objects with the same real and imaginary parts should compare equal, and more than that if you do 'dct[Complex(1, -1)] = 10; print dct[Complex(1, -1)]' you should clearly get 10 instead of 'no such key'. In order to do this you need a custom hash function so that all Complex instances with the same real and imaginary values have the same hash value.

(Although it's commonly said that custom hash functions should only be defined for immutable objects, this is not quite the true requirement. You can have mutable objects with custom hash functions provided that the mutation you can do does not change either the hash function's result or the equality comparison. You may get some odd results from dictionary operations, but presumably you know what you're doing.)

Now we get into here be dragons territory: you should be very careful about combining custom hash functions with custom equality functions that are willing to do comparisons with things outside your class. The easiest way to show what can go wrong is with a simple example:

class Boom(object):
  def __init__(self, x):
    self.x = x
  def __eq__(self, other):
    if other == "YES":
      return True
    else:
      return other.x == self.x
  def __hash__(self):
    return self.x

d = {}
d[Boom(hash("YES"))] = 10
print d["YES"]

This should get a KeyError, because we sure didn't add a "YES" to the dictionary. But because our Boom() object both has the same hash value as "YES" and compares equal to it the dictionary thinks that it is the same key and says 'sure, here you go', and our example prints 10.

(You can argue that Python dictionaries should also check the class of the key objects they are comparing and only do the == comparison if they are of the same class, but the plain fact is that CPython doesn't behave this way today. There are probably obscure legitimate uses for this.)

PS: yes, this is one of the places where duck typing might go badly wrong (cf the limits of duck typing). Despite my usual opposition to isinstance() checks, this is a case where I might accept one. Remember to put in a comment about it.

(For more on the various special methods involved in this and what you can do with them, you can see the official documentation.)


Comments on this page:

From 204.197.193.163 at 2012-04-03 14:23:59:

I'll note that this is hardly unique to python. Java has the same situation with its equals() and hashCode() methods, and it's explicitly stated that the contract for these methods is that it must be true that a.hashCode() == b.hashCode() whenever a.equals(b).

It's a classic screw-up in Java to override one of those methods and not the other.

By cks at 2012-04-04 15:54:27:

It's worth noting that unlike Java, Python only has a one way requirement; you need an __eq__ if you define a __hash__, but the reverse isn't true. I put a bit more about this in EqualityNotes.

(Technically speaking nothing explodes if you don't have the __eq__, but your __hash__ is useless and possibly counterproductive.)

Written on 03 April 2012.
« The problem of ZFS pool and filesystem version numbers
More on equality in Python (well, mostly Python 2.7) »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Tue Apr 3 00:05:34 2012
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.