2006-02-13
GNU bc for the birthday paradox
So that I don't have to redo this again if I ever need it, the GNU bc
magic for the Birthday paradox stuff is
as follows. First, you need to start GNU bc as bc -l to load the math
library, then:
define p(n, m) {
return 1 - e( (-(n * (n-1))) / (2*m))
}
define p32(n, m) {
return p(n, 2^m)
}
The p() function is the straightforward p(N, M) function from the
main entry. p32() is a version that
expresses M in bits, so you can just write things like p32(10000,
32) to find out how inadequate 32-bit session IDs are.
(Answer: very. There's already a 1.16% collision chance with a mere 10,000 session IDs. At 50,000 it's up to 25%.)
Note that this produces answers as probabilities (0.0 to 1.0), not as percentages. Multiply by 100 to get percentages.
Fiddling with p32() shows a relatively intuitive result for
thinking about how many bits you need: at small probabilities, every
added bit added (or subtracted) more or less halves (or doubles)
the probabilities. (The effect is not as simple as that when the
probability is high.)
GNU bc is programmable enough that we can write a function to do the full probability expansion for p'(N, M, R):
define pp(n, m, r) {
auto i, prod; prod = 1
for (i = 0; i < r; i++) {
prod *= 1 - n/(m-i)
}
return 1 - prod
}
This is not suitable for large values of N, M, or R, but it will give us exact values for where Daniel Martin's approximation (from the comments on the original entry) may not be well behaved.
Using this it's relatively easy to show experimentally that my proposed approximations for p' of either 'p(N+R,M) - p(N,M)' or 'p(N+R,M) - p(N,M) - p(R,M)' are not particularly accurate, even (or especially) for relatively small values of N, M, and R. So much for any mathematical intuition I might claim to possess. We can also use this to cross-check Daniel Martin's approximation, which seems to hold up quite well even in areas that are not really inside its domain.
Using both this and Daniel Martin's approximation, the good news is that people have to do a lot of session ID probes, even with relatively low numbers of bits in your session IDs. Let's do a little chart, assuming that you have three million valid session IDs and the attacker can make a million attempts:
| Bits | Collision chance |
| 32 | ~100% |
| 40 | 93.47% |
| 48 | 1.06% |
| 56 | 0.004% |
| 64 | 0.000016% |
| 72 | 0.00000006% |
(Daniel Martin's approximation is bad only for the first two and is
certainly a lot faster than the bc version, which is subject to its
own rounding issues.)
Hopefully you will detect attacks faster than a million attempts, which significantly improves the numbers. Even with 48 bit session IDs, at a hundred thousand attacks there's only a 0.1% success chance and it drops from there.
PS: Daniel Martin's perl version needs a relatively modern perl (or
the right compilation options, or both). The Fedora Core 2 version
of perl 5.8.3 just reports 'NaN' a lot; I had to go to a Fedora
Core 4 machine with 5.8.6 to get results.
2006-02-10
Session IDs and the Birthday Paradox
If you have a database (or in general some writable store), the best way to do authentication in a web app is with session ID cookies. Every time someone logs in, you pick a large random number to be their session ID, give them a cookie with the session ID, and then store all of the details in your database under the session ID. When they come back, you get the session ID cookie, look it up in the database, make sure it's (still) valid, and go.
What prevents attackers from getting in is the difficulty of guessing
a usable session ID. So, assuming you're using a good random number
generator such as /dev/urandom on Linux, how big do your session ID
numbers have to be to make one infeasible to guess?
This is a slight variant on the 'Birthday paradox' problem. With N items
and M possible values they can have, the basic Birthday paradox
approximation is (in Python notation):
1 - exp( (-(N*(N-1))) / (2*M))
(call this p(N, M).)
To start with, let's say that you expect three million valid session IDs at a time; this might seem extreme, but LiveJournal currently has two million active users. I've made a handy little table of results:
| Bits | Collision chance |
| 32 | 100% |
| 40 | 98.3% |
| 48 | 1.59% |
| 56 | 0.006% |
| 64 | 0.00002% |
| 72 | ~ 0% |
(Disclaimer: I think I, and Python, got the math right. Please feel free to correct me if I didn't. '~ 0%' means 'so small that it prints as 0'; the collision chance is clearly never literally 0.)
So 72 bits (9 bytes) should be pretty durn safe. It even remains safe at 30 million valid session IDs, which is a pretty good sign that people trying to guess valid session IDs will be there for a long, long time. If you expect 300 million valid session IDs, you'll want to go to more bits.
On most modern systems, the easy way to get this much good randomness is
just to read however many bytes you need from /dev/urandom. (Please
don't try to make up your own random number generator. The security
world is littered with the shattered remnants of people who tried that.)
More probability and math
Now, this isn't quite the answer to the question I posed at the start, which I can rephrase as:
Starting with
Nunique session IDs out ofMpossible values, what are the odds of finding a duplicate inRadditional guesses?
The figures I gave are somewhat of a handwave about, effectively,
p(N+R, M), which is more or less the same as p(N, M) if R is
(very) small compared to N.
A friend pointed out that this is just the probability of making R
different wrong guesses, so the full expression for the probability
is:
1 - (1 - N/M) * (1 - N/(M-1)) * ... * (1 - N/(M-R))
(Update: actually the terms should run only to (M-(R-1)).)
However, computing this for large Rs is irksome (unless you have a
good symbolic math package around), and I don't know if there's a good
approximation or summation.
I think p(N+R, M) - p(N, M) might be close, although that sort of overcounts by roughly p(R, M) (since the attacker's values will be non-colliding if the attacker has any brains).
Sidebar: giving everyone on earth a randomly assigned unique ID
Let's assume that we wanted to randomly assign everyone on earth a numeric ID, and be relatively sure that there were no collisions. How large a bit space would we need?
To have a margin for error, let's use a world population figure of ten billion.
| Bits | Collision chance |
| 64 | 93.35% |
| 72 | 1.05% |
| 80 | 0.004% |
| 88 | 0.000016% |
| 96 | 0.000000063% |
| 104 | 0.0000000002% |
| 128 | 0.000000000000000014% |
| 160 | ~ 0% |
(These figures are from bc, because python ran out of precision. MD5
hashes are 128 bits and SHA1 hashes are 160 bits.)