There are six different Famous Statistician cards, one (randomly) in each box of Chocolate Sugar Bombs. How many boxes do you need to buy to get all six?

This is a good three-collection simulation in Fathom, and uses two little-used features, **sample until** and the **uniqueValues( )** function.

- Make a new collection (we’ll call it
**Cards**), give it six cases and one attribute (**card**, say) with six different values, e.g., A, B, C, D, E, and F. - Sample with replacement from that source collection (
**Sample of Cards**). - In the inspector that appears, set it up to “sample until.” Then give the sampling this formula: until
**uniqueValues( card ) = 6**. That is, keep sampling until there are six different values in the sample collection.

- Create a measure in the source or sample collection that counts how many are in the collection. Typically, this is called
**N**and has the formula**count( )**. - Collect measures to find the distribution of counts. Now you can find the mean, or the 95th percentile, or whatever you want depending on how sure you want to be of having a complete set.

Here is a typical result:

So if you buy 26 boxes, you have a really good chance of getting all 6. The average in *this* set of 200 trials is 15-ish.

(Hint: the measures can go slowly if you have the sample-collection table open. Get rid of it or iconize it so Fathom doesn’t have to redraw it so frequently.)

Of course you don’t have to use 6. And you don’t have to collect all of them. Notice the similarity between this setup and the birthday problem.

]]>A bus stop has 7 stops and 4 passengers. If every passenger is equally likely to get off at any stop, what is the probability that [exactly] 2 will get off at the same stop?

How do you simulate this? First we simulate the bus. We make a collection with four cases (one for each passenger) and give it one attribute, **stop**. This thing gets a formula such as **randomInteger(1,7)** or **randomPick(1,2,3,4,5,6,7)**. So we have four random integers representing the stops where the passengers got off.

Now we need to figure out whether two got off at the same spot, that is, are (exactly) two of the numbers the same.

There are a number of approaches, but it can be tricky. You might want to use measures and **uniqueValues( )** as we did when we did the Birthday Problem. But that will cause trouble: whenever **uniqueValues( stop ) = 3**, it means two people got off at the same stop, as in { 1, 2, 2, 6 }. But {2, 2, 5, 5} and {2, 2, 2, 5} both have **uniqueValues( stop ) = 2**, and the former (says Rudy) counts as a “success” in this probability problem. So you can’t just count up the number of times **uniqueValues** is 2 or 3.

Paul’s solution was to make a longish Boolean expression. It works, but here is another solution that uses a technique that you may like: analyzing the contents of a summary table.

- Using the “bus” collection with four cases, drag
**stop**to a summary table, holding down shift to force it to be categorical. - Right-click the table and choose
**Create Collection from Cells**. A new collection appears,**Cells from bus Table**. - Make a (case) table for that collection so you can see that it contains all the information from the summary table. Weird, huh? The numbers of people who got off at each stop is in the column
**S1**. - Make a measure in that new collection. Call it
**anyTwos**. Its formula is**count(s1=2) > 0**. - Collect measures many times, and look at the distribution of
**anyTwos**.

There are several cool things about this. One is that, OMG, you can collect measures made from a table made form some other collection? Yes. And the whole chain knows to rerandomize the original collection when you collect another measure.

Another is that the logic of using the table might be more straightforward to kids. We’re asking, are there any stops where exactly two people got off? And the formula we write is, **count(S1=2) > 0**. That is, how many stops are there where two people got off? Is that number greater than zero?

Still, it’s troubling that this is so hard. How would you know to do such a thing? If I get the energy (maybe after lunch) I will write about it on that other blog.

]]>We can learn more about streakiness through simulation. You know how to make a sequence of fair coin flips in Fathom. But how do you calculate the lengths of the streaks? The key function: **runLength( )**.

- New document, new table, new attribute called
**coins**. - Give
**coins**the formula**randomPick( “H”, “T” )** - Second attribute. Call it
**run**. - Give
**run**the formula**runLength( coins )**

The results look strange at first. **Run** shows you how many coins have been the same *since the last change*. How shall we use this sequence of numbers?

The first thing we want to know is, *what is the longest streak*? That turns out to be easy: it’s the maximum value of **run**. You can find that in a dot plot or however you like.

But what we really want to know is, if I got a longest streak of 11 in 500 flips (and I just did) is that unusual? For that we need to create a sampling distribution of the maximum run length. And for that we use measures.

- Using the collection inspector (double-click the collection) go to the
**Measures**panel and make a**<new>**measure called**longest**. - Give it this formula:
**max(run)**. The maximum length should appear in the box, as shown. - Collect measures. At least 100. Make a graph of
**longest**.

So here (below) are the maximum lengths from 100 simulations of 500 coin flips. As you can see, a streak of 11 is not unusual at all. In fact, having 6 or fewer as the maximum streak length is reason to doubt that the coins are fair and the sequence is random.

**The Distribution of Streak Lengths in a Single Sequence**

Using **runLength( )** in that way will solve a lot of typical problems for you. But Jackie Pau from Jack Yates High School in Houston asked a more elaborate question. He called tech support at Key, and the question eventually found its way to me.

The question was, can you get the *distribution* of streak lengths? Of course you can! But it’s a little tricky [, grasshopper], so I’m glad you asked.

So put aside the measures and let’s go back to the two columns. We have a column of heads and tails, and a column of run lengths. We’d like to get the lengths of *all* of the streaks. So, looking at the **run** column, how do you know a new streak is starting?

Answer: the number in

runis the length of the streak if thenextnumber is “1”.

So we want a new column (let’s call is **streak**) which has the length of the streak if it’s the last coin in the streak and nothing otherwise. That is, in 500 coins, we’ll have fewer than 500 streaks.

And here is the formula, which should be mostly self-explanatory:

Two comments:

- See how
**next( )**works? - The double-double quotes are an
*empty string*. And that means it will be blank.

Now we have a third column, about half of whose cells are empty. You can get the distribution by making a graph or a summary table—just be sure to hold **shift** when you drop **streak** into the summary table, as you want to treat it as categorical:

The illustration shows the distribution of streak lengths. In this case, the longest one was 9, but there was only one of them. The most common streak, by far, is 1—which may help us understand why we often underestimate the length of the longest streak.

]]>Ok, So I am brand new to fathom and to AP Stats this year and I am getting frustrated trying to figure out how to simulate a problem on fathom. The problem says that a person buys 5 lottery tickets with 6 numbers ranging 1 to 49 on each but is surprised to find out that the winning 6 numbers are not on any of the bought tickets.

My problem is when I use Randominteger (1,49) I get repeats on numbers. I don’t think you pick the same numbers twice on a lottery ticket do you? How can I get the cases to give me 6 random numbers without replacement to carry out this simulation? I can do it on the measures, but not the cases. It’s probably quite easy but I have tried for two days and I am about to just give up! HELP!!

We need to know a little more; how does the person pick his (or her) five tickets?

Do they do a quick pick from the machine for their tickets? If so, their own tickets might have duplicates. (And it will be more likely that there are no winners on any tickets.)

Or do they buy the tickets systematically, without overlaps? Then you know the tickets cover 30 of the 49 numbers.

**Systematic Picks**

Let’s do the latter case, which is easier. First we have to agree that it doesn’t matter which 30 numbers we pick for the tickets as long as they don’t overlap, so we might as well pick convenient ones. Let’s pick 1–30! We need to simulate the lottery company sampling 6 with replacement; we’ll find the probability that all six of the numbers are over 30.

- Make a collection with one attribute,
**pick**, and 49 cases. Give**pick**the values 1–49, systematically. There are two great ways to do this: you can type in the numbers; or you can make a formula using the special variable**caseIndex**. This will work fine, but for extra safety, you might also want to clear the formula after you make the numbers. That way, the values are just plain numbers, as if you had typed them. - Sample from the collection.The sample panel in the new collection’s inspector appears. Change it so you’re sampling 6, and un-check
**with replacement**. See the illustration. (But turn**Animation**off.)

Now the sample collection will have 6 cases, and none of the **pick**s will be the same. Now we want to see whether all of the numbers are over 30. The strategy is to count how many are over 30; later we’ll see how many of these samples have this number “6.”

This is a job for measures.

- In that same sample collection’s inspector, go to the
**Measures**panel (the second tab, reading**Mea…**in the picture). - Make a new measure, let’s call it
**losers**. - Give it the formula:
**count( pick > 30 )**. - Now
**Collect Measures**from the sample collection. The new collection will be called**Measures from Sample of Collection1**(or whatever the original collection was called).

Here is what I got from 1000 measures:

None of our 1000 trials has six “losers.” That is, it’s *really* surprising that of the six numbers, none of them hit any of the 30 numbers we picked.

Your results will vary; when I expanded it to 10,000 cases, I got 20 sixes, for an empirical probability of 0.0020. This is not too far from the theoretical probability, which is (19/49)(18/48)(17/47)(16/46)(15/45)(14/44), which is about 0.0019.

**Quick Pick Case**

(I’ll describe this more quickly even though it’s more complicated.)

If the player uses Quick Pick for their five tickets, we have to be sneakier. Instead of sampling without replacement to determine the *winning* lottery numbers, we reverse the process. We assume that the lottery winners are 1–6, and sample without replacement (6 times) for one of *your* tickets. For that ticket, we count how many winners there are (and that’s a measure). (e.g., **nWinners = count( pick < 7 )**. ) We collect five measures, one for each ticket.

Then we define a measure for the collection of measures (tickets), which might be the sum of the number of winners ( **bigSum = sum(nWinners)** ). Collecting a large number of those, you look to see how many times you got zero for **bigSum**.

The key here is to note that in the first, easier case, we have three collections, a source, a sample, and measures.

In the second case, there are *four*: source, sample, measures, and measures of measures.

Suppose you have data on which you want to do a paired test. The prototypical paired test is a pre-post test of some kind. Let’s call these scores **pre** and **post**.

The basic idea is this: compute the pre-post change and test whether the mean of this **change** is different from zero.

Assuming you’re testing whether there’s a change, here’s what you do step by step:

- Make a new attribute, perhaps called
**change**. (An*attribute*in Fathom is probably known as a*variable*most places. It’s a column in the table.) - Give it a formula:
**post – pre**. (Right-click on the column heading**change**and choose**Edit Formula**. Enter the formula in the formula editor that appears.) - Create a new test by dragging it from the shelf. Change its menu to
**Test Mean**. - Drag
**change**to the space at the top. Done!

Of course, if you presumed ahead of time that the scores should go up, you do a one-tailed test.

- Click the phrase “the mean of change
**is not equal to**0″; from the menu, change it to “the mean of change**is greater than**0.”

If for some reason you want to test that the mean of change is some number other than zero, just edit the number “0” in the blue text.

]]>First, check out this post about Fathom’s test objects in general. The key to specifying the tailed-ness of a Fathom test is in the relevant blue text.

When you read the text, find the phrase that’s about whether it’s one or two tailed and click on it. It’s a menu! Choose the appropriate phrase from the menu and you’re done! The Aunt Belinda test is shown.

If you un-check **Verbose** (in the **Test** menu), it looks like this:

How do you simulate it in Fathom? The key is going to be in a function you might not be aware of: **uniqueValues( )**. We’ll make a collection to represent the people in the room, and make a measure that will tell us if the birthdays are all different. Then we’ll collect measures to find the probability.

- Let’s try
*N*= 20. Make a new table with 20 cases, and one attribute,**birthday**. - Give it this formula:
**randomInteger( 1, 365 )**. - Make a measure called
**allDifferent**, with this formula:

**count( ) = uniqueValues( birthday )**. - Don’t believe me? Rerandomize until
**allDifferent**is**false**, then plot**birthday**. It won’t take long. You’ll see the duplicate. - Onward: Collect 1000 measures and plot
**allDifferent**. And/or make a summary table. The (empirical) probability will stare you in the face. See the picture below. - Add a case at a time and re-collect (replacing the measures of course). You’ll see that the proportion drops below 0.5 at about N = 23.

**What Just Happened?**

First of all, **randomInteger( 1, 365 )** gives you, well, a random integer between 1 and 365. If I had used **randomInteger( 365 )** I could have gotten zeros. No need to use actual dates, by the way; all that matters is that there are 365 different values to choose from. And we are ignoring leap years.

Now let’s look at the measure, **allDifferent**.

Its formula is **count( ) = uniqueValues( birthday )**.

What does this do? Well: **count( )** is the number of cases in the collection (20). And **uniqueValues( birthday )** is the number of distinct values there are in the column. So if there’s a duplicate, the number of unique values will be less than 20. In that case, the expression ( **count( ) = uniqueValues( birthday ) **) will be **false**. But if there are no duplicates, the number of distinct values will be 20 as well, and **allDifferent** will be **true**.

Notice that the formula is not a usual numerical computation. It’s a *Boolean* expression, which means it can have one of two values: **true** or **false**. Special Fathom gotcha: if you put these in expressions, they *do not get quotes*! That’s because **true** and **false** are legit values, not strings. So I can write **if (allDifferent = false)** … without quotes. But when you write **count( coin = “heads” )** you *do* need the quotes. “Heads” is a string.

**Extension**

Use the techniques described in the random walk post to record the probability as a function of the number of people. This uses an additional measures collection. Sample result:

Special hint, building on the Boolean comment above: one elegant expression you can use in a measure in the measures collection is **proportion( allDifferent )**.

But sometimes you start with a one- or two-way table of counts, and you want the actual collection of data. Here’s one way to get it. It’s a workaround, but it’s reasonably quick:

- Figure out how many
*different*cases you need. It’s usually the number of cells in the table. For example, suppose you want to reproduce the data in the table shown. There are 12 different cases—just different numbers of each one. - Pick the first type of case. You’d make two columns,
**Sex**and**Marital**, and one case:**Male**,**MarriedP**. - Make a summary table and put both attributes on it. You’ll only see one case, but that will change.
- Select the single case and Copy it.
- Paste it, repeatedly, until you have 10 of them.
- Select all ten. Copy them.
- Paste until you have 110. Of course, you’re watching the summary table from step 3 in order to tell if you’re done.
- Make the next case:
**Female**,**MarriedP**. - Repeat steps 4–7 until you have 114. (At the end, when you have 110, you’ll need to copy and paste 4 more.)
- Do the same for the other types of case.

I recently used this to make a “population” from which we polled. Every student got an identical population of 10,000 voters, and did sampling to make polls of various numbers of voters. Note that Fathom won’t copy more than 5,000 cases at a time.

]]>If you’re experienced with these things, you know that the (net) distance is proportional to the square root of the number of steps. Let’s do a simulation to show, empirically, that this is true. We’ll see that this requires not only measures, but *measures of measures*. (There’s another way that uses measures of measures of measures, but let’s not go there!)

Our underlying plan is to make a source collection that represents a single walk. It will have one case for each step. The steps will have values of +1 or –1 for forward or back; that way, your position at the end of the entire walk is simply the sum of these steps. We’ll collect measures to get 1000 end-positions for that length of walk, then collect measure from that to get typical distances for each length. This diagram gives you the overview:

The key Fathom-Jedi move you’ll see is figuring out how to get the size of the walk into the final collection. That’s using the measures **N** and **NN**. I have forgotten to do this many times, grasshopper; when you forget you’ll just have to go back and put them in.

- Make a collection with ten cases. (This is for a random walk with ten steps). Make
**step**random, either +1 or –1. One good formula is

**randomPick( 1, –1 )**. - Now make some measures:

One of them is**end**; make this the position at the end of the walk. That’s**sum( step )**.

Another is**N**, the total number of steps. Jedi step one. You’ll see why we need this. Formula:**count( )**. - Collect 1000 measures. Plot
**end**. Notice that you have two attributes in that measures collection,**N**and**end**. (Note: of course, set animation off. I set the number of measures to 1000 and set it to “Replace existing cases.”)

This is the distribution of where you wind up after ten steps. This is, of course, a binomial distribution. You could use **randomBinomial( )** to shortcut this, but simulating the individual steps helps students understand what’s going on.

So far, straightforward. Now for the new stuff.

- In the measures collection, make a measure,
**spread**, which is a measure of spread such as the standard deviation of**end**: a good formula is**s( end****)**.

In my class, we used MAD, the mean absolute deviation, which is**mean( | end | )**. - The Jedi move, part 2: Make another measure
**NN**, which should be the same as all those values of**N**. We need it so that the spread we get will “know” what the original number of steps was.

A good formula:**mean( N )** - Collect measures. (Note: be sure to select
**Collect Measures**and not**Collect More Measures**! The latter will just replace the thousand measures you got before.) Now you should have a collection bizarrely named**Measures from Measures from Collection1**. - Set this new, third collection to collect
*one*measure, no animation, do*not*replace existing cases.

Let’s think about this. The **spread**—whether you use standard deviation, MAD, or some other measure of spread such as IQR—is a way of describing how far **end** is from zero, which in turn is a typical value for how far the random walker is from where he or she started.

Now we’re ready to make random walks with more steps, and record how far from the origin we typically get.

- Increase the size of the original collection (the number of steps) to 20 (i.e., add 10).
- “
**Collect More Measures**” in the newest, third collection. This records the spread for 1000 runs of a 20-step walk, along with**NN**, which is 20, the number of steps. - Plot
**spread**against**NN**. - Continue to increase the size of the random walk by adding cases to the original collection, collecting measures, and looking at the graph. You’ll see a relationship between the number of steps (
**NN**), the “spread,”**spread**. Work you way up past N=1000. More if you can!

So the graph shows that the more steps you take (**NN**) the farther you’re likely to wind up from your start. But it’s not linear; there’s a kind of diminishing returns that takes effect. High-school students in precalculus generally recognize this as the “lazy parabola” or, if you’re lucky, the square root function.

Because we used standard deviation in this example, you don’t really need a coefficient. If you use MAD like me, you should use **K * sqrt( NN )** to model the function, where **K** is the name of a slider which you use to adjust that coefficient and make the function fit. The illustration shows how the coefficient from our data is very close to 1.00.