Real data?

Examples are what stay with students. Are the examples memorable?

IN DRAFT

For decades, statisticians and statistics educators have been encouraging instructors to use “real data” in teaching. There are many examples available (see links below), but no clear statement of the good reasons why statistics can be taught more effectively when genuinely real data is the foundation.

For those unfamiliar with professional standards for statistics teaching, a good place to start is the 2016 GAISE report published by the American Statistica Association. The third of its eight central recommendations is “Integrate real data with a context and a purpose.”

This recommendation is not new: you’ll find it in the 2005 GAISE report which in turn refers to a 1992 report by the MAA/CUPM¹ report.

GAISE 2016 tries hard not to be prescriptive about the topics or goals of introductory statistics, leading to enthusiastic but somewhat vague statements:

Using real data in context is crucial in teaching and learning statistics, both to give students experience with analyzing genuine data and to illustrate the usefulness and fascination of our discipline. Statistics can be thought of as the science of learning from data, so the context of the data becomes an integral part of the problem-solving experience. The introduction of a data set should include a context that explains how and why the data were produced or collected. Students should practice formulating good questions and answering them appropriately based on how the data were produced and analyzed.

The reader of the whole report will be rewarded with more specifics and concrete examples, but the above paragraph leaves us wondering what it means to “learn from data” or to “formulate good questions.” An instructor who thinks that calculating a t statistic is a way of “learning from data” will not find reason to challenge his or her practices. Similarly, for some instructors, a “good question” can be about the number of bins to use in a histogram. And who are these students for whom completing a statistics course will not just satisfy a requirement but create a “fascination with our discipline”?

The virtues of made-up data

Teaching mathematics traditionally involves asking students to solve made-up problems that illustrate or exercise a precise concept. Textbook problems involve matrices with integer components and polynomials with integer coefficients; the exercise should be about the method or concept rather than the arithmetic. In the light of this tradition, it’s reasonable to wonder why “real data” can introduce statistical concepts better than an exactly tailored, concise, easy-to-calculate constructed example.

Andrew Gelman and Deborah Nolan describe² a wonderful unreal-data classroom activity for teaching about confidence intervals:

We ask the class how they might estimate the proportion of the earth covered by water. After several responses, we bring out an inflatable globe. … We explain that the globe will be tossed around the class, and we instruct the students to hit the globe with the tip of their index finger when it comes to them. When they do, they are to shout “water!” if their finger touches water, or “land!” if their finger touches land. After the class starts to tire of volleyball, we can use the results to calculate a confidence interval fore the proportion.

What are the good statistical questions here? I submit that they are not any of the following:

[bad question] What proportion of Earth is covered by water? Students can easily give that question to Google and get the US Geological Survey’s answer: about 71%.
[bad question] Why is random sampling a good way to get an answer? It’s hard to imagine the USGS scientists having done this with the real Earth.
[bad question] What’s the confidence interval on our estimate? The USGS answer isn’t stated in terms of a confidence interval and, for any reasonable duration of class interest, the interval your class will get is \(pm 20\) percentage points. This sends the message to students, “Google is easy and precise, statistics is hard and imprecise.”

The good statistical question is this:

How many times do we need to toss the globe to get a satisfactory answer? This question leads to others: What’s satisfactory? What’s the use for which we want to know the answer? How do we know the globe is accurate?

After a few semesters of using the Gelman/Dolan globe activity in my classes, I observed that students were consistently engaged (good!), got the idea that a small sample doesn’t give a very good answer (good!) and that it’s absolutely critical not just to have an answer – “Four out of ten tosses are water.” – but to know the precision of that answer (Great!). Unfortunately they also got the impression that generating a random sample is easy (wrong!) and that they way you find out whether a statistical method is working is to compare the statistical results to the “true” results.

Once more … with real data

I switched the activity to one using “real data.” In particular, the question I posed was, “What fraction of US is within one mile of a road?”³ What makes this about real data? First, nobody knows the answer until you collect and examine the data. Second, generating a random sample isn’t trivial. You can’t answer the question by poking your finger randomly at a map of the US.⁴

In this problem, the statistics comes in authentically with the design of data collection. And the end result is much more satisfactory: 10 minutes of classroom time tells you something you didn’t know before. I won’t tell you how we did it: figuring it out is part of the problem. (You can read about it here. The ASA site has many other examples that can inspire you.)

Real data and math problems

Sometimes finding real data requires a reconsideration of how you think about a problem. For instance, anyone who has taught calculus knows about the sort of problem where it is desired to know the volume of a lake (that just happens to be generated by a surface of revolution!) The problem isn’t really about lakes, it’s about integration.

Reconsider the lake problem using real data. In particular, here’s the data I was given by an ecologist who wanted to know the volume in order to model fluctuations in the stream originating at the lake. (You can open the map in full-page format here.)