## Checksums and hashes

April 6, 2012

Every so often I run across something on the Internet that makes me grind my teeth at how wrong it is. Today's example is Jeff Atwood's Speed Hashing, where he significantly mangles what both checksums and hashes are in an otherwise reasonable article. Since it has made me grumpy, I'm going write a summary what they actually are.

(Yes, yes, you may now cue the obligatory XKCD strip.)

A checksum function is designed to always change its value if an N bit or less change happens to the input; what the N is depends on the specific checksum function. Checksum functions are used all over computing, often invisibly, and real checksums can be mathematically proven to have their desired properties. See Wikipedia for a far more thorough overview of ways of doing error detection and correction, where I discovered that a certain amount of things that I think of as checksums are technically other related things (eg ECC).

(In the fully general definition a checksum function changes value if up to N values in the input change. In computing, we generally take the value unit to be 'a bit'; in other applications of checksums they can be things like 'a printed digit'.)

A plain hash function is designed to have an even distribution over the output values for a given set of input values. Good hashing functions can be very specific to the expected input values and the range of output values; the ultimate version of this is a perfect hash function, which efficiently maps a known set of input values to an output range. Hash functions are not guaranteed to change their output value if a bit in the input changes, partly because the hash function may consider the changed bit an 'unimportant' one and not include it at all in the calculation of the hash value.

(There are a lot of subtle issues when designing hash functions that I'm not going to try to cover here.)

A good cryptographic hash function (aka a secure hash) is a hash function that is also collision resistant (see also), non-reversible, unpredictable, and what I'll call tamper-resistant. What these mean in plain language is that it's hard to find two inputs that hash to the same output, it's hard to find an input that hashes to a specific output, that the output value doesn't let you determine properties of the input, and that you can't change the input and have it hash to the same output. Note that using a cryptographic hash does not by itself make you automatically secure against attack; you have to use it right .

(With a conventional hash function you might be able to determine that certain output values generally only occur if the input has certain properties, like being a string in all lower case as opposed to mixed case.)

Whether a given hash is a good cryptographic hash changes over time as people work out how to attack it. For example, MD5 was once a good one but is now very close to completely broken and should no longer be used. Note the implications for burning a specific cryptographic hash into long term things like the specifications for file and communication formats.

Cryptographic hashes can often be used as checksums, although I don't know if any have been mathematically proven to be completely effective checksums (where a change of N bits or less in any arbitrary input is guaranteed to change the output).

People use plain hash functions instead of cryptographic hash functions because plain hash functions can be much faster and have smaller output ranges. A plain hash function can give you a 32-bit integer, whereas secure hashes start at 160 bits (SHA1) for anything you want to use today. Even MD5 is 128 bits.

(Yes, people say that cryptographic hashes are fast. That's a relative measure; they're fast compared to, say, communication links or the time it takes for feasible brute force attacks. They're not fast when compared to plain hash functions. You do not want your kernel to be MD5'ing every filename as part of maintaining a cache mapping filenames to file information.)

By nothings at 2012-04-07 11:53:35:

To get pedantic (which naturally this sort of post invites):

"A checksum function is designed to always change its value if an N bit or less change happens to the input; what the N is depends on the specific checksum function."

I think this is overly specific definition of checksum. An N-bit CRC is designed to always report an error consisting of at most N continuous bits. It is presumably possible to toggle two widely separate bits and have no effect on the CRC and I would bet that there is no finite-sized checksum without that limitation--if you're allowed to output k bits for every N input bits, then it's different, hence 'finite-sized checksum'. (Thus, if my theory is correct, for the definition of checksum you've written, N must always be one.)

However, I suspect that in many communications channels, burst errors are no longer the most likely errors. Many other checksums do not appear to have properties of the 'continuous-bit-error' type (possibly because CRC is already optimal for those).

Adler32, for example, appears to make no such guarantees (that I can find), and is analyzed here: http://tools.ietf.org/html/rfc3309 more as if it were a hash (since a well-distributed hash maximizes chance of arbitrary error detection). Note that adler32 originated in zlib as a checksum of the uncompressed data; a single-bit error on the compressed stream can have an arbitrary effect on the uncompressed stream.

By cks at 2012-04-07 13:17:17:

Realistically you're right; everyone considers CRCs to be a form of checksums, and so I should have said more about this (technically the Wikipedia page claims that CRCs, ECCs, and so on are not literally mathematical checksums). I think that many checksums have N as one but may sometimes detect additional errors depending on the form they take.

(The general area of error detecting codes is both complex once you start looking at the technicalities and something that I don't know enough to write about. Possibly checksums are the same once you start paying close attention.)

Written on 06 April 2012.