We humans see streaks as indications of non-randomness, but streaks do occur in genuine random phenomena—usually more than we think. There are famous activities about this, like the one where you ask students to (secretly) create “realistic” sequences of 100 heads-or-tails coin flips along with genuine sequences from a fair coin. The wise, omniscient teacher then immediately distinguishes the fakes from the reals (or as we used to say, the pukka from the ersatz). How? The fair coin is streakier.
We can learn more about streakiness through simulation. You know how to make a sequence of fair coin flips in Fathom. But how do you calculate the lengths of the streaks? The key function: runLength( ).
- New document, new table, new attribute called coins.
- Give coins the formula randomPick( “H”, “T” )
- Second attribute. Call it run.
- Give run the formula runLength( coins )
The results look strange at first. Run shows you how many coins have been the same since the last change. How shall we use this sequence of numbers?
The first thing we want to know is, what is the longest streak? That turns out to be easy: it’s the maximum value of run. You can find that in a dot plot or however you like.
But what we really want to know is, if I got a longest streak of 11 in 500 flips (and I just did) is that unusual? For that we need to create a sampling distribution of the maximum run length. And for that we use measures.
- Using the collection inspector (double-click the collection) go to the Measures panel and make a <new> measure called longest.
- Give it this formula: max(run). The maximum length should appear in the box, as shown.
- Collect measures. At least 100. Make a graph of longest.
So here (below) are the maximum lengths from 100 simulations of 500 coin flips. As you can see, a streak of 11 is not unusual at all. In fact, having 6 or fewer as the maximum streak length is reason to doubt that the coins are fair and the sequence is random.
The Distribution of Streak Lengths in a Single Sequence
Using runLength( ) in that way will solve a lot of typical problems for you. But Jackie Pau from Jack Yates High School in Houston asked a more elaborate question. He called tech support at Key, and the question eventually found its way to me.
The question was, can you get the distribution of streak lengths? Of course you can! But it’s a little tricky [, grasshopper], so I’m glad you asked.
So put aside the measures and let’s go back to the two columns. We have a column of heads and tails, and a column of run lengths. We’d like to get the lengths of all of the streaks. So, looking at the run column, how do you know a new streak is starting?
Answer: the number in run is the length of the streak if the next number is “1”.
So we want a new column (let’s call is streak) which has the length of the streak if it’s the last coin in the streak and nothing otherwise. That is, in 500 coins, we’ll have fewer than 500 streaks.
And here is the formula, which should be mostly self-explanatory:
- See how next( ) works?
- The double-double quotes are an empty string. And that means it will be blank.
Now we have a third column, about half of whose cells are empty. You can get the distribution by making a graph or a summary table—just be sure to hold shift when you drop streak into the summary table, as you want to treat it as categorical:
The illustration shows the distribution of streak lengths. In this case, the longest one was 9, but there was only one of them. The most common streak, by far, is 1—which may help us understand why we often underestimate the length of the longest streak.