## Session IDs and the Birthday Paradox

February 10, 2006

If you have a database (or in general some writable store), the best way to do authentication in a web app is with session ID cookies. Every time someone logs in, you pick a large random number to be their session ID, give them a cookie with the session ID, and then store all of the details in your database under the session ID. When they come back, you get the session ID cookie, look it up in the database, make sure it's (still) valid, and go.

What prevents attackers from getting in is the difficulty of guessing a usable session ID. So, assuming you're using a good random number generator such as `/dev/urandom` on Linux, how big do your session ID numbers have to be to make one infeasible to guess?

This is a slight variant on the 'Birthday paradox' problem. With `N` items and `M` possible values they can have, the basic Birthday paradox approximation is (in Python notation):

`1 - exp( (-(N*(N-1))) / (2*M))`

(call this p(N, M).)

To start with, let's say that you expect three million valid session IDs at a time; this might seem extreme, but LiveJournal currently has two million active users. I've made a handy little table of results:

 Bits Collision chance 32 100% 40 98.3% 48 1.59% 56 0.006% 64 0.00002% 72 ~ 0%

(Disclaimer: I think I, and Python, got the math right. Please feel free to correct me if I didn't. '~ 0%' means 'so small that it prints as 0'; the collision chance is clearly never literally 0.)

So 72 bits (9 bytes) should be pretty durn safe. It even remains safe at 30 million valid session IDs, which is a pretty good sign that people trying to guess valid session IDs will be there for a long, long time. If you expect 300 million valid session IDs, you'll want to go to more bits.

On most modern systems, the easy way to get this much good randomness is just to read however many bytes you need from `/dev/urandom`. (Please don't try to make up your own random number generator. The security world is littered with the shattered remnants of people who tried that.)

### More probability and math

Now, this isn't quite the answer to the question I posed at the start, which I can rephrase as:

Starting with `N` unique session IDs out of `M` possible values, what are the odds of finding a duplicate in `R` additional guesses?

The figures I gave are somewhat of a handwave about, effectively, p(N+R, M), which is more or less the same as p(N, M) if `R` is (very) small compared to `N`.

A friend pointed out that this is just the probability of making `R` different wrong guesses, so the full expression for the probability is:

1 - (1 - N/M) * (1 - N/(M-1)) * ... * (1 - N/(M-R))

(Update: actually the terms should run only to (M-(R-1)).)

However, computing this for large `R`s is irksome (unless you have a good symbolic math package around), and I don't know if there's a good approximation or summation.

I think p(N+R, M) - p(N, M) might be close, although that sort of overcounts by roughly p(R, M) (since the attacker's values will be non-colliding if the attacker has any brains).

### Sidebar: giving everyone on earth a randomly assigned unique ID

Let's assume that we wanted to randomly assign everyone on earth a numeric ID, and be relatively sure that there were no collisions. How large a bit space would we need?

To have a margin for error, let's use a world population figure of ten billion.

 Bits Collision chance 64 93.35% 72 1.05% 80 0.004% 88 0.000016% 96 0.000000063% 104 0.0000000002% 128 0.000000000000000014% 160 ~ 0%

(These figures are from `bc`, because `python` ran out of precision. MD5 hashes are 128 bits and SHA1 hashes are 160 bits.)

By DanielMartin at 2006-02-11 01:40:49:

Note that as I mentioned elsewhere, your formula in "More Probability and Math" reduces to:

```  1 - ((M-N)!(M-R)!)/(M!(M-N-R)!)
```

Now, if we let `f(M,N,R)` be that messy bit after the `1 -`, then we can use Stirling's approximation (ln(N!) ~=~ N ln N - N) to say:

`ln (F(M,N,R)) ~=~ (M-N) ln (M-N) - (M-N) + (M-R) ln (M-R) - (M-R) - M ln M + M - (M-N-R) ln (M-N-R) + (M-N-R)`

` = (M-N) ln (M-N) + (M-R) ln (M-R) - M ln M - (M-N-R) ln (M-N-R)`

Now, `f(M,N,R)` is "probability that you're safe with M,N even if your attacker gets to make as many as R guesses", which is a useful thing to know. You, as a system designer, want `f(M,N,R)` to be close to 1; that is, you want `ln(f(M,N,R))` to be close to 0. Now, one of the nice things about `ln()` is that when `x` is extremely close to 1, `ln(x)` is approximately `x-1`. This means that you can acheive "9 nines" chance of still being safe by making sure that `ln(F(M,N,R))` is between 0 and `-(10^(-9))`

Hrm. I'm not sure that Stirling's approximation is necessarily accurate enough here for regions we're interested in...

By cks at 2006-02-11 04:23:05:

PS: if the math formulas in Daniel Martin's comment look a little sub-optimally formatted, you can blame me. I used magic site admin powers to edit his comment after he posted it to allow the formulas to wordwrap for an assortment of personal aesthetic reasons.

(In the process I found a lurking bug with comment handling and formatting. It's never the big things that get my programs, it's always the small little bits around the edges.)

By DanielMartin at 2006-02-12 02:48:19:

Actually, continuing that train of thought from the previous comment, here's a perl "one-liner" that calculates the chance that an attacker trying R different guesses can get one of your N session IDs based on using a 64-bit session key. It uses an expanded form of Stirling's approximation, which you can find here, among other places:

``` perl -le 'use bignum;sub xe{\$_[0]*log(\$_[0])+log(atan2(1,1)*8*\$_[0])/2
+ log(1+1/(12*\$_[0]));} my(\$m,\$n,\$r)=map {eval "use bignum;\$_"}
@ARGV;print -(xe(\$m-\$n)+xe(\$m-\$r)-xe(\$m)-xe(\$m-\$n-\$r))'
'2**64' '3*(10**8)' '3*(10**8)'
```

The sub `xe` is just Stirling's approximation for `ln(N!)` without the "`-N`" bits, which all cancel out when we use it anyway. The last line (presumably this won't get reformatted, since aesthetic considerations probably don't apply to perl one-liners) is the parameters M (since of keyspace), N (number of valid session IDs), and R (number of IDs the attacker can try).

With those numbers, the attacker's chancees of success are less than 0.5%. With a change to a 96 bit key, you're into "nine nines" territory. That's with both N and R being 300 million.

Fiddle with it - remember, the assorted approximations used are valid only if the answer returned is very close to 0.

Written on 10 February 2006.