Checksums and hashes
Every so often I run across something on the Internet that makes me grind my teeth at how wrong it is. Today's example is Jeff Atwood's Speed Hashing, where he significantly mangles what both checksums and hashes are in an otherwise reasonable article. Since it has made me grumpy, I'm going write a summary what they actually are.
(Yes, yes, you may now cue the obligatory XKCD strip.)
A checksum function is designed to always change its value if an N bit or less change happens to the input; what the N is depends on the specific checksum function. Checksum functions are used all over computing, often invisibly, and real checksums can be mathematically proven to have their desired properties. See Wikipedia for a far more thorough overview of ways of doing error detection and correction, where I discovered that a certain amount of things that I think of as checksums are technically other related things (eg ECC).
(In the fully general definition a checksum function changes value if up to N values in the input change. In computing, we generally take the value unit to be 'a bit'; in other applications of checksums they can be things like 'a printed digit'.)
A plain hash function is designed to have an even distribution over the output values for a given set of input values. Good hashing functions can be very specific to the expected input values and the range of output values; the ultimate version of this is a perfect hash function, which efficiently maps a known set of input values to an output range. Hash functions are not guaranteed to change their output value if a bit in the input changes, partly because the hash function may consider the changed bit an 'unimportant' one and not include it at all in the calculation of the hash value.
A good cryptographic hash function (aka a secure hash) is a hash function that is also collision resistant (see also), non-reversible, unpredictable, and what I'll call tamper-resistant. What these mean in plain language is that it's hard to find two inputs that hash to the same output, it's hard to find an input that hashes to a specific output, that the output value doesn't let you determine properties of the input, and that you can't change the input and have it hash to the same output. Note that using a cryptographic hash does not by itself make you automatically secure against attack; you have to use it right .
(With a conventional hash function you might be able to determine that certain output values generally only occur if the input has certain properties, like being a string in all lower case as opposed to mixed case.)
Whether a given hash is a good cryptographic hash changes over time as people work out how to attack it. For example, MD5 was once a good one but is now very close to completely broken and should no longer be used. Note the implications for burning a specific cryptographic hash into long term things like the specifications for file and communication formats.
Cryptographic hashes can often be used as checksums, although I don't know if any have been mathematically proven to be completely effective checksums (where a change of N bits or less in any arbitrary input is guaranteed to change the output).
People use plain hash functions instead of cryptographic hash functions because plain hash functions can be much faster and have smaller output ranges. A plain hash function can give you a 32-bit integer, whereas secure hashes start at 160 bits (SHA1) for anything you want to use today. Even MD5 is 128 bits.
(Yes, people say that cryptographic hashes are fast. That's a relative measure; they're fast compared to, say, communication links or the time it takes for feasible brute force attacks. They're not fast when compared to plain hash functions. You do not want your kernel to be MD5'ing every filename as part of maintaining a cache mapping filenames to file information.)
Why we haven't taken to DTrace
Recently I read Barriers to entry for DTrace adoption (via Twitter). As it happens I have an opinion on this, since we use Solaris and I have done a modest amount of things with DTrace. My belief is that DTrace has between two and three problems, depending on how you look at it.
(Part of our non-use of DTrace is that I once had a bad experience where starting to use DTrace on a production fileserver had immediate and significant bad effects. I've seen DTrace work okay since then but the uncertainty lingers, especially for writing my own DTrace scripts. But that's only a relatively modest part of it.)
First is that it's pretty hard to really use DTrace if you're not familiar with Solaris kernel internals. This issue takes some explanation (unless you've tried to use DTrace, in which case you're probably awfully familiar with it). What it boils down to is that there are really two DTraces, one for extracting subsystem information from the kernel and one for debugging the kernel, and the first one is incomplete.
In theory, DTrace lets you tap into all sorts of documented trace points that Solaris has put into the kernel, extracting a wide variety of interesting state from each of them (you can read the coverage of the various providers in the DTrace documentation). In practice, the Solaris kernel developers have never provided enough trace points with enough state information to be really useful by themselves. Instead they leave you to fall back on the 'kernel debugging' side of DTrace, where you can intercept and trace almost any function and extract random information from kernel memory provided that you know what you're looking for and what it means.
There are two problems with this (at least from my perspective). The first is that most of the really interesting uses of DTrace require using the kernel debugging DTrace and using the kernel debugging DTrace requires understanding the internals of the kernel. Ideally you need the code, which has always made things a little bit interesting (even before Solaris went closed source, OpenSolaris source did not exactly match Solaris (cf)). The second is that the DTrace documentation has never tried to address this split, instead throwing everything together in one big pile that (the last time I read it) was probably more oriented towards the person doing a deep dive into the kernel than a sysadmin trying to cleverly extract useful information from what trace points there are.
(One sign of the documentation quality is that there is a plethora of blog entries and web sites that try to explain clever DTrace tricks and how to use it to get interesting results. Personally I would like to see the documentation split into at least two parts, one for sysadmins and one for people debugging the kernel.)
Second (or third, depending on how you view the documentation problem) is that the DTrace scripting language has plenty of annoying awkwardness and pointless artificial limitations. These are situations where DTrace can do what you want but it forces you to jump through all sorts of hoops with no assistance; one example I've already mentioned is pulling information from user space. Many of these issues could be fixed with things like macros and other high level language features (or specific support for various higher level operations), but the DTrace authors seem to have deliberately chosen to keep much of the language at a low level. This is a virtue in a system language but DTrace isn't a system language, it's a way of specifying what information you want to extract from the system and when.
(One unkind way to put this is that the DTrace scripting language is mostly oriented around the needs of the people writing the kernel DTrace components instead of the people who are trying to use DTrace. It's easy to see how this happened but it doesn't make it right.)
These issues don't make DTrace impossible to use, and as a demonstration of that lots of people have written lots of very interesting and useful DTrace scripts. But they do significantly raise the barriers to entry for using DTrace; for most serious and interesting uses, you have to be prepared to learn kernel internals and slog through a certain amount of annoyance and make-work. It should not be any surprise that plenty of people haven't had problems that are sufficiently urgent and intractable to cause them to do this.
(It is not just that this stuff has to be learned. It's also that the learning simply takes time, probably significant time, and many people may not have that much time if they're dealing with a non-urgent problem.)