What Should We Do with the Auburn Kindergarten iPad Findings?
Mike Muir and I are having a productive, respectful back and forth specifically about his research concerning iPads in Kindergarten classrooms and more broadly about how practitioners should deal with educational research that uses statistical methods.
I’m going to start with this reminder (which Mike has echoed in his own way): I think it’s completely awesome that Mike and colleagues are running randomized control trials of iPads in Kindergarten classrooms. We know absolutely nothing about how these tools might work with young kids, the teacher experiments going on are promising, and we need this kind of research being done. Kudos to Mike and his colleagues for doing it.
In the end, I think Mike and I agree about how school districts should act upon Mike’s early findings (if they hold up, more on that in a minute). I think Mike and I come to similar conclusions through somewhat different routes. Mike argues that researchers and practitioners should analyze research findings differently. I disagree. They should analyze findings the same way, and they should act on them differently.
The story in a nutshell: Auburn, Maine has a multi year early literacy initiative (Bravo! Nothing can be more important in schools). As part of that initiative they introduce ipads and they stagger the introduction of iPads in such a way that half of the 16 K classrooms get them in September and half don’t get them until December. Mike and colleagues measured students with ten tests of early literacy, and then compared the intervention group (the kids in iPad classrooms) and the control group (the kids who didn’t get the iPads early). Last week, they released a press release and a two page research report reporting the following: In all 10 tests, the average scores of the students in the intervention classrooms exceeded the scores of the control classrooms. In 9 of those 10 tests, the differences between the two classrooms were not significantly different. In 1 test, the difference was significantly different.
So how should people read these findings? Here are my suggestions:
First, researchers, journalists and practitioners should analyze the statistics in exactly the same way. Mike argues that “It is a major fallacy to think everyone should be a researcher, or think and analyze like one .” I won’t comment about what “everyone” should do, but I will say that anyone making judgments based on statistical findings should approach those findings using the same analytic strategies. Journalists and practitioners whose job involves interpreting statistical findings should learn the basics of how to make sense of tests of statistical significance or should build partnerships with people who can help them. [Edit: Please read the comments, where Mike rephrases his point.]
That’s a major purpose of this blog: to help people learn how to read research findings in education technology. We simply need more literacy among educators about how to evaluate claims of “research-based” ideas. I’m not writing this blog from “a researcher’s perspective.” I’m writing from a perspective of someone whose job it is to help educators make real-world decisions about research (you can go here for a sense of the consulting that I do).
Whoever you are or whatever your background, when you read the research report, your conclusions should be pretty similar: In 1 of 10 tests, the iPads modestly but significantly improved student learning as measured by a particular test. (Go back to the original post for more discussion of significant). In 9 of 10 tests, we have little confidence that the iPads improved students literacy scores, and even if you believe they did, the improvements are very, very small.
In regard to these nine Mike says “It is accurate to say we are unsure of the role chance played on those results.” But’s that’s not the way I would put it. The more important issue is “It is accurate to say we are unsure of the role that the intervention– the iPads–played on those results.” Again, it doesn’t matter if you are a researcher or a practitioner–anyone analyzing those numbers should have little confidence that the iPad had an impact on those 9 measures. If you decide to set aside the results of the statistical testing and believe that the iPads had a positive impact on all 10 measures, then you should at least acknowledge that the effect sizes of the iPad intervention on those 9 measures were very,very small.
Now, what people do with those findings absolutely should be different. Educators constantly have to make decisions with imperfect information. Researchers get to throw up their hands and say “who knows what it means! Let’s go back to our lairs and devise more experiments.” Educators have to say “we may not be totally sure what the research means, but we have to make decisions regardless.” (Mike and I appear to be in complete agreement here.)
So you are a superintendent interested in iPads for your district… what should you actually do with these findings?
First, no school district outside of Auburn should make any decisions based on these findings under the full report is released and the data is offered to other researchers. I think it would be great to have Mike and colleagues release this report as a white paper rather before releasing a peer reviewed article– the need is great and the information is much needed given the scale of iPad investment happening in schools. But until we have a robust report of the findings, people simply shouldn’t give serious credibility to these findings. That is not at all a knock on Mike and colleagues, but as the recent fiasco at CERN demonstrates, initial findings can be as fleeting as an atomic particle. (Researchers reported the possibility of a faster-than-light particle, and then found that the discovery was due to loose cables rather than our fundamentally misunderstanding of the laws of the universe).
There are all kinds of potential “loose cables” in a research report like this; here are two I would be looking for:
- Failed randomization: not that they did anything wrong, but by chance, the control and intervention classrooms might have been different. If the baseline scores of the control group are lower than the intervention group, then the results might be better chalked up to catching up than to the iPads.
- Outliers (or high-leverage cases to be more precise): the intervention-group differences in 9 of the 10 tests are so modest that one really successful classroom or a few kids with big score gains could be entirely responsible for the positive non-significant findings
My estimation is that’s an appropriately cautious suggestion for education leaders. There are traces of good stuff here; we need a few pioneering districts, like Auburn, to keep following the trail.