The Key to the Birthday Problem: uniqueValues( )

Posted on May 30, 2012

2


It’s a famous probability problem: given N people in a room, what’s the probability that at least two of them share a birthday? It’s famous because the answer is often surprising: a remarkably small number (23) will give you a probability over 50%.

How do you simulate it in Fathom? The key is going to be in a function you might not be aware of: uniqueValues( ). We’ll make a collection to represent the people in the room, and make a measure that will tell us if the birthdays are all different. Then we’ll collect measures to find the probability.

  1. Let’s try N = 20. Make a new table with 20 cases, and one attribute, birthday.
  2. Give it this formula: randomInteger( 1, 365 ) .
  3. Make a measure called allDifferent, with this formula:
    count( ) = uniqueValues( birthday ).
  4. Don’t believe me? Rerandomize until allDifferent is false, then plot birthday. It won’t take long. You’ll see the duplicate.
  5. Onward: Collect 1000 measures and plot allDifferent. And/or make a summary table. The (empirical) probability will stare you in the face. See the picture below.
  6. Add a case at a time and re-collect (replacing the measures of course). You’ll see that the proportion drops below 0.5 at about N = 23.

What Just Happened?

First of all, randomInteger( 1, 365 ) gives you, well, a random integer between 1 and 365. If I had used randomInteger( 365 ) I could have gotten zeros. No need to use actual dates, by the way; all that matters is that there are 365 different values to choose from. And we are ignoring leap years.

Now let’s look at the measure, allDifferent.
Its formula is count( ) = uniqueValues( birthday ).

What does this do? Well: count( ) is the number of cases in the collection (20). And uniqueValues( birthday ) is the number of distinct values there are in the column. So if there’s a duplicate, the number of unique values will be less than 20. In that case, the expression ( count( ) = uniqueValues( birthday ) ) will be false. But if there are no duplicates, the number of distinct values will be 20 as well, and allDifferent will be true.

Notice that the formula is not a usual numerical computation. It’s a Boolean expression, which means it can have one of two values: true or false. Special Fathom gotcha: if you put these in expressions, they do not get quotes! That’s because true and false are legit values, not strings. So I can write if (allDifferent = false) … without quotes. But when you write count( coin = “heads” ) you do need the quotes. “Heads” is a string.

Extension

Use the techniques described in the random walk post to record the probability as a function of the number of people. This uses an additional measures collection. Sample result:

Probability that there is no common birthday as a function of the number of people.

Special hint, building on the Boolean comment above: one elegant expression you can use in a measure in the measures collection is proportion( allDifferent ).

Advertisements