## Checksums and hashes

April 6, 2012

Every so often I run across something on the Internet that makes me grind my teeth at how wrong it is. Today's example is Jeff Atwood's Speed Hashing, where he significantly mangles what both checksums and hashes are in an otherwise reasonable article. Since it has made me grumpy, I'm going write a summary what they actually are.

(Yes, yes, you may now cue the obligatory XKCD strip.)

A checksum function is designed to always change its value if an N bit or less change happens to the input; what the N is depends on the specific checksum function. Checksum functions are used all over computing, often invisibly, and real checksums can be mathematically proven to have their desired properties. See Wikipedia for a far more thorough overview of ways of doing error detection and correction, where I discovered that a certain amount of things that I think of as checksums are technically other related things (eg ECC).

(In the fully general definition a checksum function changes value if up to N values in the input change. In computing, we generally take the value unit to be 'a bit'; in other applications of checksums they can be things like 'a printed digit'.)

A plain hash function is designed to have an even distribution over the output values for a given set of input values. Good hashing functions can be very specific to the expected input values and the range of output values; the ultimate version of this is a perfect hash function, which efficiently maps a known set of input values to an output range. Hash functions are not guaranteed to change their output value if a bit in the input changes, partly because the hash function may consider the changed bit an 'unimportant' one and not include it at all in the calculation of the hash value.

(There are a lot of subtle issues when designing hash functions that I'm not going to try to cover here.)

A good cryptographic hash function (aka a secure hash) is a hash function that is also collision resistant (see also), non-reversible, unpredictable, and what I'll call tamper-resistant. What these mean in plain language is that it's hard to find two inputs that hash to the same output, it's hard to find an input that hashes to a specific output, that the output value doesn't let you determine properties of the input, and that you can't change the input and have it hash to the same output. Note that using a cryptographic hash does not by itself make you automatically secure against attack; you have to use it right .

(With a conventional hash function you might be able to determine that certain output values generally only occur if the input has certain properties, like being a string in all lower case as opposed to mixed case.)

Whether a given hash is a good cryptographic hash changes over time as people work out how to attack it. For example, MD5 was once a good one but is now very close to completely broken and should no longer be used. Note the implications for burning a specific cryptographic hash into long term things like the specifications for file and communication formats.

Cryptographic hashes can often be used as checksums, although I don't know if any have been mathematically proven to be completely effective checksums (where a change of N bits or less in any arbitrary input is guaranteed to change the output).

People use plain hash functions instead of cryptographic hash functions because plain hash functions can be much faster and have smaller output ranges. A plain hash function can give you a 32-bit integer, whereas secure hashes start at 160 bits (SHA1) for anything you want to use today. Even MD5 is 128 bits.

(Yes, people say that cryptographic hashes are fast. That's a relative measure; they're fast compared to, say, communication links or the time it takes for feasible brute force attacks. They're not fast when compared to plain hash functions. You do not want your kernel to be MD5'ing every filename as part of maintaining a cache mapping filenames to file information.)

Written on 06 April 2012.