Revisiting checksum functions

April 10, 2012

In his comment on Checksums and hashes, nothings correctly called me on some stuff in my discussion of checksum functions. So let me revisit my bit on checksums and try to do it better.

In a strict technical sense, per Wikipedia, there is a relatively narrow mathematical definition of a checksum function. Or perhaps a somewhat broader one (this is the joy of Wikipedia). In either case, it seems that at least some of what we informally think of as 'checksums' are not checksum functions in this mathematical sense.

But as nothings points out, in common programming use most people have a broader view of checksums; something is a checksum if it computes a value that we use to detect errors (well, changes) in the input data, more or less. The exact mathematics don't matter as much as the intended usage and result. Checksum algorithms vary in their complexity (and thus their speed), in the size of the output value (anywhere from one bit for parity, up to 16 or 32 bits for common fast checksums, to 128 bits or more for cryptographic hashes used as checksums), and in what sort of errors they detect; you get to pick an algorithm for your problem that meets your needs.

(Note that checksums are generally considered to not protect you against deliberate, malicious modification of the message. If you need that protection you need a good cryptographic hash, among other things.)

Using a cryptographic hash is an obvious brute force way to get a solid checksum function. They have relatively large output values (128 bits and up) and run quite slowly compared to something like Addler32, but for many applications neither is a significant drawback. This makes cryptographic hashes the easy checksum choice for many applications; just run your data through the standard library SHA1 interface and you're done.

(My view is that the rules for checksum functions are a lot like the rules for cryptography: never invent your own algorithms and never do your own implementations, because both are too easy to screw up. Screwing up a checksum is less explosively dangerous than screwing up cryptography, but why take chances if you don't need to?)

Sidebar: checksums versus error correcting codes

Checksums blur into the whole area of error correcting codes in general. I'm not sure where the boundaries get drawn, especially in common usage as opposed to strict mathematics. My mostly uneducated understanding is that something is a checksum if it has a constant-sized output regardless of the input size (eg SHA1 is always 160 bits regardless of how large the input data is); it turns into an error correcting code if it has to add X bits per Y bits of input.

However, the boundaries are clearly fuzzy here. For example, parity is considered a checksum in general usage but is usually done in fixed size blocks with one bit of parity per N bits of input. Possibly people just merge error correcting codes into a very broad view of 'checksums'.

(What this really means is that I don't know as much about the whole field of checksums as I should.)

Written on 10 April 2012.
« Understanding hashing in Python
Faking (or not) a ternary if operator with && and || »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Tue Apr 10 01:15:29 2012
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.