2007-11-29
Hashes are not complete protectors of privacy
Suppose that you have been charged with building a traffic tracking system for your NAT gateway to be used when the campus NOC calls you up to report that your gateway is doing too much of the wrong sort of traffic and gives you the remote IP addresses involved. At the same time you don't want to log too much information, so that all this juicy traffic data you're gathering can't be used to do something like find out all the websites someone has been visiting.
No problems; you're smart, so you know about hashes and their uses for privacy, so you just hash all the remote IP addresses and store the hashes instead of the actual IP addresses (you have enough storage space that this isn't a problem). Fishing expeditions are prevented, because while someone could get all the hashes that their grad student had visited, they can't reverse them to IP addresses, but when the NOC gives you an IP address you can hash it and then look up the hash in your traffic database to find the originators.
(We will ignore the inconvenient fact that a database of the SHA1s of all 2^32 IP addresses is only 80 GB, and you don't even need that much since a bunch of IP address space is unused, reserved, or not relevant.)
Unfortunately, this hashing scheme is not completely protecting your users' privacy. While it has stopped people from finding out all the websites someone has visited, it is not stopping people from finding out who has visited a specific website. People can't get traffic patterns, but they can find out whether a Chinese grad student is browsing a Falun Gong website (or even which Chinese grad students are doing so).
In other words, using a hash here has created privacy in only one direction; your users have secrecy (their browsing habits can't be determined) but not deniability (it can be shown that they visited something specific). If you need full privacy, with both secrecy and deniability, you need to use more than simple hashes.
(Disclaimer: this is not how we implemented our NAT gateway traffic tracking system.)
2007-11-22
The different types of hash collisions
When you're considering using hashes as indexes, people can wind up saying that collisions aren't going to be a problem because the hashes are cryptographically sound. At such times, it's important to remember the distinction between deliberate hash collisions and accidental hash collisions.
The distinction between deliberate hash collisions and accidental ones is that an accidental hash collision is one that just happens, while a deliberate hash collision is one that can be produced by an attacker. All hashes that are shorter than their inputs necessarily have accidental collisions, but a cryptographically sound hash is resistant to deliberate collisions (an attacker should not be able to find one faster than using various exhaustive search techniques).
From this we can see that knowing that a hash is cryptographically sound tells us nothing about the chance for an accidental hash collision. With a cryptographically sound hash, the probability of having an accidental collision is determined by the birthday paradox, based on the size of your data set and the number of bits in the hash's output.
(For example, if you're thinking about indexing URLs using a SHA1 hash, you are probably safe, since an estimate puts the total size of the web at 30 to 40 billion pages and in 160 bits that's pretty much a 0% collision chance.)
Sidebar: types of deliberate hash collisions
There are at least two sorts of deliberate hash collisions that cryptographers worry about:
- an attacker can find two inputs, X and Y, that hash to the same value; this is called a collision attack.
- given an existing hash or an input A, an attacker can find a second input, B, that hashes to the same value; this is called a preimage attack (having just the hash is called '(first) preimage', while having the input A itself is called 'second preimage').
The second sort of deliberate hash collisions are a superset of the first sort; if you can find preimage collisions, you can clearly find collisions where you get to pick A. There are successful collision attacks against MD5 and SHA1, but as of yet there are no practical preimage attacks.
(RFC 4270 has a nice discussion of all of this.)
2007-11-12
Matching words against a list in the Bourne Shell
One of the interesting challenges in writing shell scripts is figuring out how to do as much as is practical with shell builtins and primitives. Since Bourne shell builtins are such a limited and constrained environment, the results can be somewhat perverse and peculiar and thus neat and amusing.
(If you can find an efficient way of doing things, using builtins is much faster than having to resort to external programs, especially when the system is loaded.)
My challenge today was checking whether a word was in a list of words (words don't have spaces), in a Bourne shell dialect that would work as far back as Solaris 2.5. What I came up with is:
case "$wlist" in
$w|"$w "*|*" $w "*|*" $w") echo yes;;
*) echo no;;
esac
(Where $w is the word and $wlist is the space-separated list of
words to match against. You need the first match condition to deal
with the case where $wlist consists only of $w.)
I sometimes feel I have an advantage in this sort of perversity, because
I started out writing shell scripts in a version of the Bourne shell
that was so old that it didn't even have test (aka [ ... ]) as a
builtin. When you don't have a built in test you get a lot of
experience with abusing case, because it is basically the only builtin
testing operation you have.
(To this day I have a tendency to use case for condition tests where
I really should use if. Something in the back of my mind is still not
really convinced that things like 'if [ "$#" -eq 0 ]' don't require
an external program.)
2007-11-04
Thinking through salts for passwords
No one who's sane stores actual plaintext passwords; instead they hash the password in some way and store the result, throwing away the password itself as fast as possible. There are three ways you can do this:
- you can hash just the password.
- you can hash some global value plus the password.
- you can hash some per-user value (a 'salt') plus the password.
If you hash just the password, anyone who can somehow gain access to your password hashes can immediately run them through a precomputed dictionary of hashes to see what pops out.
If you hash a global value plus the password, the attacker can no longer use their general precomputed dictionary. However, they can still compute a special dictionary just for you and check all of your hashes against it, either by precomputing the dictionary or by just checking every one of your hashes against every hash they generate.
If you use a per-user value, an attacker cannot compute a site-wide dictionary for your hashes; they must attack each password one by one. This is obviously the best option.
(I don't know enough to know if adding a global value too does any good. I think that at the most it insures that the same password plus salt on two sites will have different hashes. However, a good salt should be large enough and random enough that this is relatively rare anyways.)
Note that all of this says nothing about the difficulty of attacking a single password. As Matasano Chargen reminds us, that comes down to the speed of your password hashing scheme, and things like MD5 and SHA1 are embarrassingly and dangerously fast and should be avoided if you have a choice.
(This is partly me thinking this through out loud to fix it in my head.)