Understanding rainbow tables
In yesterday's entry, I casually misused the term 'rainbow table' because I thought I knew what it meant when I actually didn't; what I was actually talking about was a (reverse) lookup table, which is not at all the same thing. I'm sure that I've read a bit about rainbow tables before but evidently it didn't stick and I didn't really get them. As a result of dozzie's comment pointing out my error, I wound up reading the Wikipedia writeup of rainbow tables, and I think I may now understand them. In an attempt to make this understanding stick, I'm going to write down my version of how they work and some remaining questions I have.
The core ingredient of a rainbow table is a set of reduction functions, R1 through Rn. Reduction functions take a hash value and generate a password from it in some magic way. To create one entry in a rainbow table, you start with a password p1, hash the password to generate hash h1, apply the first reduction function R1 to generate a new password p2, hash that password to generate h2, apply R2 to generate password p3, and so on along this hash chain. Eventually you hit hash hn, use your last reduction function Rn, and generate p(n+1). You then store p1, the first password (the starting point), and p(n+1), the endpoint (I think you could store the hash of p(n+1) instead, but that would take up more space).
To see if a given hash of interest is in the rainbow table, you successively pretend that it is every hash h1 through hn and compute the output of each of the hash chains from that point onward. When you're pretending the hash is hn, this is just applying Rn; when it's h(n-1) this is applying R(n-1) to generate pn, hashing it to get hn and then applying Rn, and so on. You then check to see if any of the computed endpoint passwords are in your rainbow table. If any of them are, you recreate that chain starting with your stored p1 starting point up to the point where in theory you should find the hash of interest, call it hx. If the chain's computed hx actually is the hash of interest, password px from just before it is the corresponding password.
Matching an endpoint password doesn't guarantee that you've found a password for the hash; instead it could be that you have two hashes that a given reduction function maps to the same next password. If your passwords of interest are shorter than the output of the password hash function this is guaranteed to happen some of the time (and shorter passwords is the usual case, especially with modern hash functions like SHA256 that have large outputs).
The length of the hash chains in your rainbow table is a tradeoff between storage space and compute time. The longer the hash chains are the less you have to store to cover roughly the same number of passwords, but the longer it will take to check each hash of interest because you will have to compute more versions of chains (and check more endpoint passwords). Also, the longer the hash chain is, the more reduction function variations you have to come up with.
(See the Wikipedia page for an explanation of why you have multiple reduction functions.)
As far as I can see, a rainbow table doesn't normally give you exhaustive coverage of a password space (eg, 'all eight character lower case passwords'). Most of the passwords covered by the rainbow table come from applying your reduction functions to cryptographic hashes; the hashes should have randomly distributed values so normally this will mean that your reduction functions produce passwords that are randomly distributed through your password space. There's no guarantee that these randomly distributed passwords completely exhaust that space. To get a good probability of this, I think you'd need to 'oversample' your rainbow table so that the total number of passwords it theoretically covers is greater than your password space.
(Although I haven't attempted to do any math on this, I suspect that oversampled rainbow tables still take up (much) less space than a full lookup table, especially if you're willing to lengthen your hash chains as part of it. Longer hash chains cover more passwords in the same space, at the cost of more computation.)
The total number of passwords a rainbow table theoretically covers is simply the number of entries times the length of the hash chains. If you have hash chains with 10 passwords (excluding the endpoint password) and you have 10,000 entries in your rainbow table, your table covers at most 100,000 passwords. The number of unique passwords that a rainbow table actually contains is not something that can be determined without recording all of the passwords generated by all of the reduction functions during table generation.
Sidebar: Storing the endpoint password versus the endpoint's hash
Storing the hash of the endpoint password instead of the endpoint password seems superficially attractive and it feels like it should be better, but I've wound up believing that it's usually or always going to be a bad tradeoff. In most situations, your passwords are a lot shorter than your hash values, and often you already have relatively long hash chains. If you have long hash chains, adding one more entry is a marginal gain and you pay a real space penalty for it. Even if you have relatively short chains and a relatively small table, you get basically the same result in less space by adding another reduction function and officially lengthening your chain by one.
(Reduction functions are easily added as far as I can tell;
apparently they're often basically '(common calculation) + i',
i is the index of the reduction function.)