The different types of hash collisions
When you're considering using hashes as indexes, people can wind up saying that collisions aren't going to be a problem because the hashes are cryptographically sound. At such times, it's important to remember the distinction between deliberate hash collisions and accidental hash collisions.
The distinction between deliberate hash collisions and accidental ones is that an accidental hash collision is one that just happens, while a deliberate hash collision is one that can be produced by an attacker. All hashes that are shorter than their inputs necessarily have accidental collisions, but a cryptographically sound hash is resistant to deliberate collisions (an attacker should not be able to find one faster than using various exhaustive search techniques).
From this we can see that knowing that a hash is cryptographically sound tells us nothing about the chance for an accidental hash collision. With a cryptographically sound hash, the probability of having an accidental collision is determined by the birthday paradox, based on the size of your data set and the number of bits in the hash's output.
(For example, if you're thinking about indexing URLs using a SHA1 hash, you are probably safe, since an estimate puts the total size of the web at 30 to 40 billion pages and in 160 bits that's pretty much a 0% collision chance.)
Sidebar: types of deliberate hash collisions
There are at least two sorts of deliberate hash collisions that cryptographers worry about:
- an attacker can find two inputs, X and Y, that hash to the same value; this is called a collision attack.
- given an existing hash or an input A, an attacker can find a second input, B, that hashes to the same value; this is called a preimage attack (having just the hash is called '(first) preimage', while having the input A itself is called 'second preimage').
The second sort of deliberate hash collisions are a superset of the first sort; if you can find preimage collisions, you can clearly find collisions where you get to pick A. There are successful collision attacks against MD5 and SHA1, but as of yet there are no practical preimage attacks.
(RFC 4270 has a nice discussion of all of this.)