Using summary tables in simulations

Posted on August 17, 2012


In the apstat community thingy, Rudy Medina posted a problem, and Paul Myers posted a solution. I’ll show another solution, and maybe veer off into philosophy. Here’s the problem (from 16 August 2012):

A bus stop has 7 stops and 4 passengers.  If every passenger is equally likely to get off at any stop, what is the probability that [exactly] 2 will get off at the same stop?

How do you simulate this? First we simulate the bus. We make a collection with four cases (one for each passenger) and give it one attribute, stop. This thing gets a formula such as randomInteger(1,7) or randomPick(1,2,3,4,5,6,7). So we have four random integers representing the stops where the passengers got off.

Now we need to figure out whether two got off at the same spot, that is, are (exactly) two of the numbers the same.

There are a number of approaches, but it can be tricky. You might want to use measures and uniqueValues( ) as we did when we did the Birthday Problem. But that will cause trouble: whenever uniqueValues( stop ) = 3, it means two people got off at the same stop, as in { 1, 2, 2, 6 }. But {2, 2, 5, 5} and {2, 2, 2, 5} both have uniqueValues( stop ) = 2, and the former (says Rudy) counts as a “success” in this probability problem. So you can’t just count up the number of times uniqueValues is 2 or 3.

Paul’s solution was to make a longish Boolean expression. It works, but here is another solution that uses a technique that you may like: analyzing the contents of a summary table.

  1. Using the “bus” collection with four cases, drag stop to a summary table, holding down shift to force it to be categorical.
  2. Right-click the table and choose Create Collection from Cells. A new collection appears, Cells from bus Table.
  3. Make a (case) table for that collection so you can see that it contains all the information from the summary table. Weird, huh? The numbers of people who got off at each stop is in the column S1.
  4. Make a measure in that new collection. Call it anyTwos. Its formula is count(s1=2) > 0.
  5. Collect measures many times, and look at the distribution of anyTwos.

The summary table (left) and then, a case table from the collection “derived” from the table. Notice the new attribute S1.

The measure’s formula was
count( S1=2) > 0

There are several cool things about this. One is that, OMG, you can collect measures made from a table made form some other collection? Yes. And the whole chain knows to rerandomize the original collection when you collect another measure.
Another is that the logic of using the table might be more straightforward to kids. We’re asking, are there any stops where exactly two people got off? And the formula we write is, count(S1=2) > 0. That is, how many stops are there where two people got off? Is that number greater than zero?

Still, it’s troubling that this is so hard. How would you know to do such a thing? If I get the energy (maybe after lunch) I will write about it on that other blog.