Recall that, in playing and analyzing dice games appropriate for middle school students’ study of probability, I was challenging my secondary methods students (to whom I refer as “483 students” after the course number, not because of how many of them there are) to justify that there are 3 ways, not 2, to roll a 10 when rolling two six-sided dice. The idea is that my future teachers know that there are 3 equally likely possibilities: (1) a 6 and a 4, (2) a 4 and a 6, and (3) two 5’s. But lots of seventh grade students do not know this. Instead, many will view the first two as being the same.

I pushed hard on this point. My students suggested making a 6 by 6 chart, which is useful for some seventh graders. They suggested rolling one die at a time, or rolling two different color dice, or rolling one die twice. Each of these has the same theoretical probability as rolling two identical dice simultaneously. But not all seventh graders know this. I pushed on.

In particular, I was hoping to challenge my 483 students to wrestle with the complicated relationship between theoretical and experimental probability. Most of the time in middle school classrooms we study both of these but we dismiss discrepancies by waving our hands and saying *We don’t expect these to be exactly equal; we expect them to be close*, and therefore we shouldn’t worry about goofy experimental probabilities.

I was pressing my 483 students to consider whether experimental probabilities can ever provide convincing evidence that our theoretical model is incorrect. A recent article in *Mathematics Teaching in the Middle School *described a lesson in which seventh graders were asked to decide which dice were loaded and which were fair. I recall a lesson in my educational statistics class in which the professor opened a new deck of cards, shuffled several times and drew cards from the top of the deck. She was curious how many red cards in a row we would have to see before we suspected that something was up.

My challenge to my students was in a similar spirit but I wanted to push them to design statistical tests that would demonstrate that “two ways to roll a 10” is a flawed model. This meant they needed to outline their procedures in full, state the data they would collect and (most importantly) which results would support their theoretical model. I added the additional constraints that the test could not take longer than 10 minutes to run and that they needed to be willing to stake their teaching licenses on the outcome. OK, I was flexible on that last constraint, but it helped lend seriousness to their thinking.

So here are two of the tests they devised:

(1) We will roll two dice 100 times. We will count the number of doubles and the number of non-doubles. If there are only two ways to roll 10, then there are 15 non-doubles and 6 doubles. If there are three ways to roll 10, then there are 30 non-doubles and 6 doubles. In 100 rolls with our theoretical model, we expect 83 non-doubles. With the competing model, we expect 71 non-doubles. We’ll split the difference. If there are 77 or more non-doubles in 100 rolls, then our model is correct.

(2) Keep rolling until you get ten 10’s. If there are only two ways to roll a 10, then we should expect to have to roll 105 times to get ten 10’s. If there are three ways to roll a 10, then we should expect to roll 120 times. Again, we can split the difference; if our test yields more than 112 rolls, this indicates that there are three ways to roll a 10.

BEFORE READING FURTHER, jot down which of these two tests you think is better for demonstrating which model is correct (Hint: one of them is much better than the other).

Notice that test (1) relies on common denominators while test (2) relies on common numerators. That is, test (1) sets the total number of rolls and asks how many 10’s we got, while test (2) sets the number of 10’s and asks how many total rolls we made.

Each of these tests confirmed the correct model in a single trial in class.

But probability isn’t about one-time outcomes. It is about long-term results. So it’s worth asking whether our results in class were typical. In other words, how likely were these tests to work?

I have lately become curious about the potential for the software Fathom to help students to make these connections between experimental and theoretical probability. The software does lots of things well, but what makes it unique is its ability to do probability simulations (see my article in *Mathematics Teacher*).

We ran each test with dice in class a small number of times. In the time it took to run each test once, I set up a Fathom simulation, which can then be run many, many times. For the record, I think electronic simulations only make sense after collecting real-world data; otherwise they are too abstract for many students to learn from.

In 100 Fathom trials, test (2) only “works” 51 times. That is, the test is no better than a coin flip. Increase the number of 10’s required to 20 and the test still only succeeds 64% of the time.

Test (1) is much better. In 100 Fathom trials, the test “worked” 95 times.

It turns out that devising a good experiment to determine which model is better (order matters vs. order does not matter) is hard. Therefore, we shouldn’t be surprised (1) that middle school students find it challenging to decide which model is correct, (2) that their own models, which are based on their informal observation of experimental probabilities in the world around them, get in the way of analyzing theoretical probabilities, nor (3) that teaching probability is hard.