Examples are what stay with students. Are the examples memorable?
IN DRAFT
For decades, statisticians and statistics educators have been encouraging instructors to use “real data” in teaching. There are many examples available (see links below), but no clear statement of the good reasons why statistics can be taught more effectively when genuinely real data is the foundation.
For those unfamiliar with professional standards for statistics teaching, a good place to start is the 2016 GAISE report published by the American Statistica Association. The third of its eight central recommendations is “Integrate real data with a context and a purpose.”
This recommendation is not new: you’ll find it in the 2005 GAISE report which in turn refers to a 1992 report by the MAA/CUPM1 report.
GAISE 2016 tries hard not to be prescriptive about the topics or goals of introductory statistics, leading to enthusiastic but somewhat vague statements:
Using real data in context is crucial in teaching and learning statistics, both to give students experience with analyzing genuine data and to illustrate the usefulness and fascination of our discipline. Statistics can be thought of as the science of learning from data, so the context of the data becomes an integral part of the problem-solving experience. The introduction of a data set should include a context that explains how and why the data were produced or collected. Students should practice formulating good questions and answering them appropriately based on how the data were produced and analyzed.
The reader of the whole report will be rewarded with more specifics and concrete examples, but the above paragraph leaves us wondering what it means to “learn from data” or to “formulate good questions.” An instructor who thinks that calculating a t statistic is a way of “learning from data” will not find reason to challenge his or her practices. Similarly, for some instructors, a “good question” can be about the number of bins to use in a histogram. And who are these students for whom completing a statistics course will not just satisfy a requirement but create a “fascination with our discipline”?
The virtues of made-up data
Teaching mathematics traditionally involves asking students to solve made-up problems that illustrate or exercise a precise concept. Textbook problems involve matrices with integer components and polynomials with integer coefficients; the exercise should be about the method or concept rather than the arithmetic. In the light of this tradition, it’s reasonable to wonder why “real data” can introduce statistical concepts better than an exactly tailored, concise, easy-to-calculate constructed example.
Andrew Gelman and Deborah Nolan describe2 a wonderful unreal-data classroom activity for teaching about confidence intervals:
We ask the class how they might estimate the proportion of the earth covered by water. After several responses, we bring out an inflatable globe. … We explain that the globe will be tossed around the class, and we instruct the students to hit the globe with the tip of their index finger when it comes to them. When they do, they are to shout “water!” if their finger touches water, or “land!” if their finger touches land. After the class starts to tire of volleyball, we can use the results to calculate a confidence interval fore the proportion.
What are the good statistical questions here? I submit that they are not any of the following:
- [bad question] What proportion of Earth is covered by water? Students can easily give that question to Google and get the US Geological Survey’s answer: about 71%.
- [bad question] Why is random sampling a good way to get an answer? It’s hard to imagine the USGS scientists having done this with the real Earth.
- [bad question] What’s the confidence interval on our estimate? The USGS answer isn’t stated in terms of a confidence interval and, for any reasonable duration of class interest, the interval your class will get is \(pm 20\) percentage points. This sends the message to students, “Google is easy and precise, statistics is hard and imprecise.”
The good statistical question is this:
- How many times do we need to toss the globe to get a satisfactory answer? This question leads to others: What’s satisfactory? What’s the use for which we want to know the answer? How do we know the globe is accurate?
After a few semesters of using the Gelman/Dolan globe activity in my classes, I observed that students were consistently engaged (good!), got the idea that a small sample doesn’t give a very good answer (good!) and that it’s absolutely critical not just to have an answer – “Four out of ten tosses are water.” – but to know the precision of that answer (Great!). Unfortunately they also got the impression that generating a random sample is easy (wrong!) and that they way you find out whether a statistical method is working is to compare the statistical results to the “true” results.
Once more … with real data
I switched the activity to one using “real data.” In particular, the question I posed was, “What fraction of US is within one mile of a road?”3 What makes this about real data? First, nobody knows the answer until you collect and examine the data. Second, generating a random sample isn’t trivial. You can’t answer the question by poking your finger randomly at a map of the US.4
In this problem, the statistics comes in authentically with the design of data collection. And the end result is much more satisfactory: 10 minutes of classroom time tells you something you didn’t know before. I won’t tell you how we did it: figuring it out is part of the problem. (You can read about it here. The ASA site has many other examples that can inspire you.)
Real data and math problems
Sometimes finding real data requires a reconsideration of how you think about a problem. For instance, anyone who has taught calculus knows about the sort of problem where it is desired to know the volume of a lake (that just happens to be generated by a surface of revolution!) The problem isn’t really about lakes, it’s about integration.
Reconsider the lake problem using real data. In particular, here’s the data I was given by an ecologist who wanted to know the volume in order to model fluctuations in the stream originating at the lake. (You can open the map in full-page format here.)
Ask interesting questions
One good way to define “real data” involves the kind of questions the data can address. Those questions should be interesting and the answers should be unknown.
“Interest” is subjective. To paraphrase Lincoln, “You can interest some of the people all of the time, and all of the people some of the time, but not all of the people all of the time.” Finding interesting data involves a balance between what’s available and your students actual interests. Fortunately, there are lots of case studies in data science being published in the form of websites and blog entries, and more in data-science textbooks. Much of the data has already been curated and is easy to access.
As you get started working with real data, you’ll quickly discover a need for real skills for working with data.
Toward a definition of “real data”
The matter actually goes back much further
One can usefully follow the reasoning back to John Tukey’s famous 1962 paper, “The future of data analysis.”5
Links
http://www.amstat.org/asa/files/pdfs/GAISE/GaiseCollege_Full.pdf
http://www.amstat.org/asa/files/pdfs/GAISE/2005GaiseCollege_Full.pdf
1992 CUPM report: https://www.maa.org/sites/default/files/pdf/CUPM/pdf/CUPM_Report_1992.pdf
Principles
- instance before theory
- discover things about data: e.g. single-humped,
students should see data in formats where it really appears. OpenData initiatives.
instructors should approach the data set with their own questions.
there should be the potential for students to suggest something compelling that would be new to the instructor.
Mathematical Association of America, Committee on the Undergraduate Program in Mathematics, link to report↩
Teaching Statistics: A Bag of Tricks, Oxford Univ. Press, 2002↩
Better, “I claim that only 25% of the US is within one mile of a road. Am I right?”↩
A map of sufficiently fine scale so that a finger covers less than a mile would be about 100 ft wide and very tall.↩