# Are iPads making a significant difference? Findings from Auburn Maine.

# Are iPads making a significant difference? Findings from Auburn Maine.

Audrey Watters has an interesting article on early results from an assessment of iPads deployed in kindergardens in Auburn, ME. It’s a perfect place for me to get to one of the core purposes of this blog– to look at educational research results and critique them from the perspective of a fellow researcher. The goal is to help readers be more saavy consumers of educational research. My take is pretty different from Audrey’s (who I think is a brilliant ed tech journalist). I also want to start the post by applauding the team of researchers for tackling this important study, even though I disagree with their interpretation of the data.

**The Study**

In the study, the Auburn district is in the midst of a multi-year, literacy intervention. Teachers are getting all kinds of training on early literacy stuff–helping kids learn to read. Then, this September, 8 of 16 classes are randomly assigned iPads (the intervention group), and the other 8 get them in December (the control group). Kids are tested in September and December (before the control group gets iPads), and the study measures average difference in score *gains* between the control and intervention groups.

### The Findings

First, Audrey starts with this encouraging sentence

The school year is now almost halfway over, and the early results are in from the district’s implementation. The results are good too: iPads increased the kindergarteners’ literacy scores.

She derives that sentence from this figure and the resulting conclusion (from the press release for the study found here):

Comparing the OSELA gains from the iPad and comparison settings, gain scores were again consistently greater for the iPad students than were observed in the comparison settings. Most notably, students in the iPad setting exhibited a substantial increase in their performance on the Hearing and Recording Sounds in Words (HRSIW) subtest, which measures a child’s level of phonemic awareness and ability to represent sounds with letters. Subsequent statistical analyses showed that, after controlling for students’ incoming Fall 2011 scores, the impact of being in an iPad classroom had a statistically significant relationship with students post-HRSIW scores (t= 2.36, p.<.05). After controlling for other variables, the nine week impact was equivalent to a 2.1 point increase on the HRSIW, on average, for the iPad cohort.

I read these data quite differently. The results show that in 9 of 10 categories,** there is no difference between the two groups**. “But Justin,” you say, “The Red numbers are always larger than the Blue numbers.” That is true. But, the point of statistical testing is to determine whether differences between two numbers are because of a relationship between two variable or because of random chance. When statistical testing shows that the difference between two numbers is not statistically significant, we should treat those two numbers as *not different*. In other words, it is reasonably likely that for 9 of the 10 variables measured, any differences that we observe are due to chance rather than due to the fact that the iPads made a difference. The standard reading, from a statisical perspective, is that iPads had no impact on 9 of the 10 variables measured.

(Could you make a special pleading for the iPads because all of the numbers were higher? Maybe. But if it is making a difference, it’s making a tiny, tiny difference. If I’m the superintendent in Auburn, its a reason to continue the pilot, not a reason to buy iPads for all the kids. )

Next, the researchers argue that the one variable with a statistically significant difference, Hearing and Recording Sounds in Words, shows a “substantial improvement.” It’s very important to remember that “statistically significant” does NOT necessarily mean substantively significant. We should treat the iPads as having increased the HRSIW; we should believe that those two numbers are different (caveat to follow). But how much is a 2 point gain over a semester? The researchers give us no information that we can use to interpret that difference. How much do students typically gain over a year? What is the standard deviation of the HRSIW in comparable populations of Kindergardeners? What is the range of the scale? Without that information, we cannot independently assess whether a 2 point gain is stellar, or a relatively modest gain from an intervention that cost many thousands of dollars.

Now if you are a researcher, you can figure out some of these things. I Googled HRSIW and found this document. I haven’t really vetted it, but it’s on the site from an area support agency in the UK, so let’s pretend it’s true. It shows that the standard deviation for the HRSIW in 5 year olds is about 10 units on the test. So to get the effect size of the intervention, we can divide the difference in scores, about 2, by the standard deviation, about 10, to get an effect size of .2. Generically, we’d consider that effect size to be small, based on Cohen’s guidelines. A better choice would be to compare this intervention to other similar interventions on HRSIW, to see how this compares to things like lowering class size, specific kinds of training for teachers, better breakfasts in the morning, and so forth. Since the researchers don’t provide these, we’ll go with the generic guidelines, and I would interpret the gains as “modest” rather than “substantial.”

Moreover, we have to start considering a second problem. In t-testing, we assume that if there is less than a 1/20 chance that our results could have come from a population where there is no relationship between predictor and outcome (in this case, between having an iPad and improving a measure), then our results are “statistically significant.” This means, that if we run lots of analyses, then we should expect that occasionally, maybe 1 in 20 times, we find a statistically significant result when there is no difference between two variables (a false positive). In other words, if you run 10 t-tests, and one shows up as significant (p=.02), then you have to wonder a bit if chance has thrown you a statistically significant result from a population with no difference between the control and intervention. (This problem is known as Type I error.)

So my headline isn’t “iPads increased scores substantially.” It’s **“iPads modestly increased Kindergarten literacy scores in 1 of 10 measures tested**.”

### Other Articles

So I wrote this post only reading Audrey’s article, but now I’ve found two other articles from the Twitter feed of Mike Muir*. *The articles draw two different conclusions, and I think one gets it right. An article in THE Journal has the headline: “Kindergarten iPad Initiative Reveals Modest Literacy Gains” and reports:

The performance gains were admittedly modest, but 129 of the iPad students showed improvement on the Hearing and Recording Sounds in Words (HRSIW) assessment, which measures a child’s level of phonemic awareness and ability to represent sounds with letters.

That sentence is mostly right, but it’s not that 129 students showed improvement, it’s that as a group, the average scores of those 129 students showed improvement. Still, that’s basically the right take.

The Loop, an online journal I have not heard of, has the headline: “iPad improves Kindergartners literacy scores”

According to the literacy test results classes using the iPads outperformed the non-iPad students in every literacy measure they were tested on.

This isn’t a sound interpretation. The classes using the iPads had scores not significantly different from the non-iPad students. If you do claim that there are differences, then you have to note that most differences are tiny.

I also found the blog of Mike Muir, one of the authors of the study. His interpretation is that the results confirm that ” iPads Extend a Teacher’s Impact on Kindergarten Literacy.” I don’t think that I would use the word confirm. I think I’d say that they are suggestive that iPads may have a modest impact. One small experiment, even a randomized control trial, I don’t think should rise to the level of “confirmation,” especially when results are so modest. I definitely applaud Mike for doing the work he is doing and for using the most robust design he can, but I don’ t think the data support his interpretation.

### The Design

Let me also raise a second quibble with the analysis by Audrey and the researchers.

Audrey, and the researchers she interviews, take care to note that it’s not just the device that are responsible for any resultant score gains. The iPads are part of a larger package of reforms. Quoting Audrey:

Was this a result of the device? The apps? The instruction? The professional development? Muir, Bebell, and fellow researcher Sue Dorris (a principal at one of the elementary schools) wouldn’t say.

I will say. It’s true that the whole reform package is responsible for score gains in the 16 classrooms. But, if the study was designed correctly, any score gain *differences *between the control and intervention group are **entirely the results of the iPads . ** It sounds from the article that in the randomized trial, the only intervention was the timing of the arrival of the device. Therefore, if the randomization was done correctly (and we have no evidence one way or the other from the press release from the researchers) any score difference should be attributed

**entirely**to the iPads. That’s the whole point of a randomized control trial, keep everything as identical as possible between the control and intervention conditions, except the intervention itself.

I totally agree with the overall point that it’s important to remember that iPads don’t just appear in classrooms. But the point of random trials is to test specific differences. The specific difference here is the timing of the iPads arrival, and we should be able to credit any differences not to the years of reform beforehand (which both groups enjoyed) but to the one thing that makes the control and intervention groups different: the timing of the arrival of the iPad.

### My Conclusion

I applaud the Auburn researchers for tackling this study, and I applaud Audrey for trying to tease out the meaning of these results. I think we have very much to learn from the statistical testing tools that methodologists have developed over the last century, and I think applying these tools can be very powerful in testing the efficacy of new technologies. But reading statistical output is tricky business, and everyone who reads the reports of researchers should take care to evaluate how well a researchers numbers support their interpretations. In this case, I think the press release put out by the researchers overstates the impacts of iPads on these particular measures in this particular intervention. My take is that iPads didn’t make much difference here, but I’ll look forward to reading the entire research report when it is released.

## 22 Comments

## John norton

February 18, 2012I am curious and a little concerned about (1) teacher learning curve in becoming skillful in application of iPads, and (2) whether what’s being measured is a good indicator of increased learning. Looking for sig. Change after 4 months (prob. Closer to 3) using a new teaching device AND method? I wouldn’t put much stock in this…

## Justin Reich

February 18, 2012Actually, these are not necessarily concerns that I share for this particular study, though I think in general these are good questions to raise about a study: what exactly is the intervention being measured? What is the quality of the outcome measures? Is the study long enough to measure meaningful change?

From my very brief review (I’m not an early literacy expert) I think these measures are legitimate measures of student achievement in important elements of learning for 5 year olds.

I also agree that teacher learning with iPads is important, but that’s not really what this study is designed to measure. Both control and intervention groups have received early literacy training, and one group had a set of iPads as well. This study tests whether those iPads made a difference, and it is designed well to make that evaluation.

Is the study too short– sure, but it’s almost impossible to convince districts to do these kinds of studies that require the inequitable distribution of resources. Getting 9 weeks of intervention is too short from a research perspective, but it’s pretty good given the political realities of schools.

If anything, I think some of the points you bring up lend credence to the authors claim that this study suggests that iPads support early literacy learning. If they can get some results with only 9 weeks and with teachers inexperienced with iPads, then maybe with more time and experience we’d see even better results. But, we shouldn’t assume that, we should study it.

## Charles Robertson

February 18, 2012Nicely done critique Justin. Lots of similarities between the problems we face in teaching and medicine

## Justin Reich

February 18, 2012Yep. Humans are not widgets. And medicine is a good place to find parallels. We definitely need experiments like this one in education, just like we need drug trials in medicine.

## Ellen Smyth

February 18, 2012This is a well written article, and I do appreciate your insight. However, I disagree that no statistical significance equates to meaning that there is “no difference.” Rather, I would say that no statistical significance means there is not enough data in the data collected to tell if there is indeed a real difference in the overall population of interest. I wish I could have their numbers to compute a p-value for the overall results.

## Justin Reich

February 18, 2012Thanks for your comment. One thing we agree on is that it would be great to have access to the data and the report from this research. In fact, without that, I’m not sure that we can responsibly draw anything but the most tentative of conclusions from the findings. I’d be interested to learn more about why the authors did a press release without a white paper.

We do disagree about how to interpret a t-test. If researchers are choosing to use a t-test, then they choose to test the null hypothesis that there is no relationship between two variables. When we say a relationship is not statistically significant, we are saying that we have failed to reject the null hypothesis, and therefore we conclude that there is no relationship between the two variables. In this case, we should conclude (in the absence of special pleading) that for nine of the ten variables studied, the intervention had no impact (made “no difference”) on the outcome.

Consider this thought experiment: what if they had studied a trillion students (presumably by researching all children in both the DC and Marvel multiverses)? If they reported non-significant findings, would you still argue that they lack data? Reporting the results of a significance test cannot tell you whether a study had sufficient statistical power.

It may be that future studies, with more samples, and more data, might change our understanding of these relationships, but examining the results of this study, we should assume that the differences in gain score means for 9 of the 10 variables could have reasonably been sampled from a population where there are is relationship between the intervention and the outcome measure.

(If you want to make a special pleading for the fact that in all ten cases, the scores were higher for the intervention group, I’m willing to listen to that, but the first thing I’m going to ask is for your effect sizes, which are going to be really small.)

There will be no p-value for the overall results, only ten separate p-values for 10 bivariate analyses. And we know roughly what they all are– 9 of the 10 variables have p greater than .05 (some will be pretty close to 1), and 1 will be around .02.

## Ellen Smyth

February 19, 2012Thanks for your detailed reply, Justin. I truly respect your credentials, your wit, and your expertise as an educational technologist. Let me briefly explain my credentials, which, while not as illustrious as yours, are pertinent to the topic at hand.

I have a bachelor’s degree from a reasonably respected engineering school in engineering and a master’s degree from that same school in mathematics. I only have three courses in statistics – one as an undergrad engineer and two graduate level experimental design courses. I have, however, been teaching undergraduate elementary statistics for a few years, and I have relied on the authority of the books to fill in gaps where I lack knowledge and expertise, asking my statistician colleagues for advice whenever something doesn’t quite make sense to me.

Both textbooks I have taught from, Neil Weiss’s and Agresti and Franklin’s, have said that we never accept the null hypothesis. While we can reject the null and prove that the alternative hypothesis is true, we can never really believe that the null is true unless we have all the data.

Here is why. It is often easy to prove that a value is greater than some value, less than some value, or not equal to some value. For example, if we randomly collected even just 30 people from our population and computed their average age to be 72.86 years with a standard deviation of 3.7 years, it would seem obvious to me that we could be reasonably certain (at least 95% or 99%) that the mean age of this population is more than 5 years old. However, we could conclude with no certainty whatsoever that the mean of this entire population was exactly 72.86 because if we collected even just one more random person, he or she would probably not be exactly 72.86 years old, thus throwing off our average.

Even if our sample size was 300 of a population of 600, the natural variability in our data values would tell us that we still cannot conclude the average will stay at exactly 72.86 when we survey the other half of the data. So, because the null hypothesis is always that the mean (or proportion or whatever we are measuring) is exactly equal to a particular value, we can never really prove it to be true without all the data – unlike what we can do for the alternative because we need only prove that the mean is greater than, less than, or not equal to some data value. The null hypothesis in question here is that the difference between the test scores from two populations is zero. While it may seem close and probably is close to zero, we cannot prove that it is exactly zero, which is what the null hypothesis says.

That is the explanation I have, but please don’t take my word for it. Here is one better resource, http://www.minitab.com/uploadedFiles/Shared_Resources/Documents/Articles/not_accepting_null_hypothesis.pdf. And then, if you just Google these three words, accept null hypothesis, you’ll hopefully find a consensus from respectable sites that agree we just don’t accept the null to be true.

## Justin Reich

February 19, 2012Hi Ellen,

This is a good conversation for this blog, as one of my goals is to bridge the worlds of researchers and practitioners, and to help translate between the two. I’m very familiar with the literature that you cite here, and I’ve thought a lot about practitioners should move ahead in these circumstances.

You are correct that statisticians never truly accept the null to be true (note that in the original post I was careful about this–I said we should treat the pairs of gains scores as not different; I didn’t say they were truly not different or that we had proven anything), but you are wrong about the conclusions that we should draw from that.

There are nearly no circumstances in the social sciences (and most sciences) where we have all the data. The population of interest here is “Kindergarteners in Western education systems.” There are a lot of them this year, and there will be more in the years ahead. We never get to have all of the data, so practitioners have to work in a world where this is the case.

Actually the article you link to provides good guidance of what to do when encountering a non-significant finding, so I’ll quote from them: “[If the finding is non-significant]Fail to reject the null hypothesis (p-value > α) and conclude that there is not enough evidence to state that the alternative is true at the pre-determined confidence level of X%.”

If you conclude that you that you cannot support the alternative hypothesis–the alternative hypothesis being that the intervention had an effect– then what should a superintendent, a policymaker, a teacher, a person with real kids in front of them do? They should act as if the intervention had no effect– you should treat non-significant differences between treatment and control as 0. Are they truly zero? We don’t know. Could more data help? Maybe, but if the intervention doesn’t work, you can collect reams of data and it won’t change your findings.

There are exceptions and nuance to all of this. But in general, when encountering non-significant differences, practitioners should not parse the meaning of tiny differences, they should treat them as not different.

## Ellen Smyth

February 20, 2012I would love to explore bridging these gaps between researchers and practitioners. That is a great goal. And I certainly agree that having all the data is impractical and never happens in these situations. It is still important to think about how the entire population could be so very different from our sample or experimental data, though, to keep in check the conclusions we draw.

I am in 100% agreement with your quote from the Minitab article, though we may be interpreting that quote differently. Just because there is not enough evidence to prove the alternative is true doesn’t mean we are any closer to believing the null is true. Rather, the null is often no more supported than it was before the study. If our p-value was 0.1 or 0.2, for instance, we could and probably would fail to reject, but these p-values actually do more to contradict the null than support it. They just don’t contradict it enough to reject the null because the burden of proof is on the alternative hypothesis.

Also, not having a statistically significant difference is not the same as not having a practically significant difference. In fact, in this diagram we see a 6.9% increase in test scores after using the iPad, and I would consider increasing test scores by 6.9% to be of practical significance, though perhaps not offsetting the cost of the devices. Of course, this difference could zero for the whole population, but, more to my point, the difference could be even much larger. In truth, the gain for using iPads could be 10% to 20%, depending on the unknown deviation of the data.

We may be saying the same thing in the end. What I am saying is this: the study does little in the way to prove an increase, showing an increase in one part of the test but not proving an overall increase in the entire test or in other parts of the test. As such, we shouldn’t rely on this one study to encourage or discourage our thoughts on how iPads might influence how students learn.

My personal hypothesis is that gains or lack thereof will be much more about how the instructor uses the device than just the device itself. I am hoping for a grant for iPads (feel free to roll your eyes), and I doubt that I will have significant results in the first year. I will need time to explore, to readjust my curriculum, to build materials so I can effectively invert the classroom, to build clicker questions and engagement activities for these devices, to readjust my assignments so they incorporate the iPads, to find apps that will do awesomely cool statistical simulations, and to build the apps I need if I cannot find them. All of this stuff takes time to develop, time to adjust, and time to perfect. I hope that by the end of my second year I will really start to see increases, but two years may be a overly optimistic. And the naysayers could be correct: I might never see a difference. Yet I am still willing to roll those dice.

## Justin Reich

February 20, 2012Ellen, this has been a great back and forth, and I’m really grateful for your contributions. It is through these kinds of dialogues that we can get to the difficult questions of how to deal with research findings in a practical way.

I am, as blog owner, going to take the prerogative of having the last word. I respectfully, but strongly, disagree with most of your final comment. In a few places I’m going to use all CAPS, because I can’t use bold. I’m not shouting, I’m emphasizing. Let me refute a few points:

“I am in 100% agreement with your quote from the Minitab article, though we may be interpreting that quote differently. Just because there is not enough evidence to prove the alternative is true doesn’t mean we are any closer to believing the null is true.”

This is quite wrong. Every study with non-stat sig findings brings us closer to believing the null is true. Are you asserting that every time we run a drug test, and the drug has no effect on a human, we are no closer to knowing that the drug doesn’t work? That is not tenable. When we get non-stat sig findings on an intervention study, we should be lead to believe that the intervention–as implemented–did not work and the null hypothesis is true. We should stay open minded to new studies, but any well done study, even one with non-sig findings, should bring us closer to conclusions about the efficacy of an intervention.

” If our p-value was 0.1 or 0.2, for instance, we could and probably would fail to reject, but these p-values actually do more to contradict the null than support it.”

I’m not sure what you are talking about here. There are no p-values reported anywhere in the press release. However, it’s obvious from looking at the differences that some of the p-values here like for List and READ, are going to be well above .1 I just want to make clear to readers that neither of us has any access to actual p-values, except for one reported stat sig relationship (where we have a t-stat that we can convert).

“Also, not having a statistically significant difference is not the same as not having a practically significant difference.”

In almost all cases, that is exactly what this means. If there is no stat sig relationship between an intervention and an outcome THEN WE REJECT THE HYPOTHESIS THAT THE INTERVENTION WORKED. That is the fundamental premise of hypothesis testing and frequentist statistics. There are certain cases where special pleading may be relevant, but in general, what non-stat sig means is that, in the real world, in the population of interest, we have reasonable evidence that the intervention had no effect.

” In fact, in this diagram we see a 6.9% increase in test scores after using the iPad, and I would consider increasing test scores by 6.9% to be of practical significance, though perhaps not offsetting the cost of the devices.”

I’m not sure where 6.9% came from. But the whole point of t-testing is that the differences in outcome and predictor ARE REASONABLY LIKELY TO HAVE COME FROM A DISTRIBUTION WITH NO RELATIONSHIP BETWEEN OUTCOME AND PREDICTOR. Beyond that, we can be almost completely assured in this case that the very modest gains, which assuredly have tiny effect sizes, are very unlikely to be cost-effective.

” In truth, the gain for using iPads could be 10% to 20%, depending on the unknown deviation of the data.”

Sure. Future studies could give us more insight, which could go in either direction. But from this study, we should lean towards believing that the ipads, in the time frame with the given conditions, had very, very modest impact on the classrooms.

“We may be saying the same thing in the end. What I am saying is this: the study does little in the way to prove an increase, showing an increase in one part of the test but not proving an overall increase in the entire test or in other parts of the test.”

We agree.

” As such, we shouldn’t rely on this one study to encourage or discourage our thoughts on how iPads might influence how students learn.”

We do not agree. We have exactly one study done on iPads in K. If the full report suggests the study was well done, then we should rely on this study for guidance. And the findings suggest that iPads modestly improve the cognitive skills measured by HRISW, and they do not improve the cognitive skills as measured by the other nine variables. We should also calculate the costs of the iPads, the effect sizes of the benefits, and we should see how this intervention compares to other interventions (which I predict won’t look great for the iPads in this study). If the study is well done, it absolutely should shape our understanding of K-2 or K-3 iPad use.

” I am hoping for a grant for iPads (feel free to roll your eyes),”

I won’t roll my eyes at all. It’s great that you are applying for iPads. I hope that when you use them, you can study their impact in robust ways. The findings from this study have absolutely no bearing on your situation. There is nothing at all that a public school, K, literacy intervention has to say about undergraduate statistics education. It’s like testing a drug on rats and then making claims about how a drug will effect gila monsters or velociraptors.

I whole-heartedly support your decision to test out iPads in a completely different circumstance, and I’d encourage you to partner with researchers and produce some studies that can be broadly useful.

If I have a horse in this race, it’s definitely for the iPads. I run a company, edtechteacher.org, and we provide some fantastic PD about iPads. Believe me, if this Auburn study was a slam dunk, I’d sing it to high heaven and put it in marketing materials. But it’s not. What we may have learned (when we read the study) is that early evidence suggests that iPads have a very modest impact on HRISW, and only very, very tenuous evidence that they have a weak impact on other variables. There is certainly more to be learned, but in this study, it’s hard to support the conclusion that iPads broadly increase K literacy. There are some encouraging pieces in these findings, but anyone who runs a K-2 literacy program should think very carefully about these findings before investing in iPads–if your goal is to boost the cognitive skills measured by the outcomes in this study.

And we should keep trying new ways of integrating iPads, and we should keep exploring their potential.

## Mike

February 23, 2012Just wanted to say thanks for the detailed critique of the Auburn study. Like you, I am hopeful that digital media may one day be effectively used in early childhood classrooms to support core learning areas, but I think we should be extra-diligent about what the research actually says. There are already a lot of people who see little-to-no-value, or even harm, in the use of new technologies with young children, and I think we should be careful of over-promoting research that gives an inaccurate view of how effective technology is for early literacy.

## Justin Reich

February 23, 2012I should probably say that I critiqued the press release and a few subsequent press articles. At this time, to my knowledge, they have not actually released a study…

## Responding to a Critique of Auburn’s iPad Research Claims | Multiple Pathways

February 23, 2012[…] last week, Audrey Watters was one of the first to cover it. Shortly thereafter, Justin Reich wrote a very thoughtful review of our research and response to Audrey’s blog post at his EdTechResearcher blog. Others, through comments made in post comments, blogs, emails, and […]

## Responding to a Critique of Auburn’s iPad Research Claims | Multiple Pathways

February 23, 2012[…] last week, Audrey Watters was one of the first to cover it. Shortly thereafter, Justin Reich wrote a very thoughtful review of our research and response to Audrey’s blog post at his EdTechResearcher blog. Others, through comments made in post comments, blogs, emails, and […]

## Mike Muir

February 23, 2012Justin, thanks for writing such a thoughtful critique of our research study (and I clearly need to catch up on reading the comment thread here!). Where your critique of our work has been thoughtful, I’m not sure that is the case with everyone’s reaction. This has prompted me to write another blog post. It does offer my reaction to some of what you say here (respectfully, as part of the professional dialog you have nurtured here), as well as my response/rant to some of what others have suggested.

http://multiplepathways.wordpress.com/2012/02/23/responding-to-a-critique-of-auburns-ipad-research-claims/

## Glen Gilchrist

February 27, 2012Hi

Excellent article showcasing the power of blogs to inform and fuel debate. Thanks.

As a teacher who ultimately needs to act upon research as detailed in this article can I please make a plea to all journalists who report educational interventions – please please treat us as intelligent professionals. Daily we are hit with statements such as “boys underperformance compared to girls”, “the link between poverty and attainment” and the source of this article “iPads and attainment” — but nowhere in the literature do we see real statistics being reported – just analysis of means.

So, we are faced with implementing policy based on superficial analysis of data – for me, in my school – being left or right handed is statistically significant in determining attainement, but gender is not. BUT as the means of boys/girls are different we are compelled to “do something about it”

Come on researchers, journalists and implementers – please publish the stats so we can all join the debate – don’t reduce important decisions to means of data sets.

Cheers

Glen Gilchrist

## Education & politics in Maine: is it the iPad? | Technology with Intention

February 27, 2012[…] For the other 9 assessments completed in this 9-week study, there was no statistical significance reported. My understanding of “statistical significance” is that any results recorded are different enough as to suggest they weren’t caused by chance. Justin Reich reviews the study results and offers his perspective on the conclusions in his article, Are iPads making a significant difference? Findings from Auburn Maine. […]

## EdTechResearcher » Are iPads making a significant difference? Findings from Auburn Maine. « Bibliolearn

March 1, 2012[…] tackling this important study, even though I disagree with their interpretation of the data.”Via http://www.edtechresearcher.com Like this:LikeBe the first to like this […]

## EdTechResearcher » What Should We Do with the Auburn Kindergarten iPad Findings?

March 2, 2012[…] but significantly improved student learning as measured by a particular test. (Go back to the original post for more discussion of significant). In 9 of 10 tests, we have little confidence that the iPads […]

## M. Shane Tutwiler

June 26, 2012All,

Thanks for this lively and thoughtful debate. I’m afraid I missed out on it out here in sunny Taipei, but I would like to add one final spin.

Justin did a great job of framing the critique in terms of Type I error. But, I think it’s also important to view the study through the lens of Type II error as well. Specifically, because the team used t-tests (two-tailed, I’m assuming), assumed a Type I error threshold (alpha) of <.5, and the effect size was approximately 0.2 (on the one significant finding, as estimated by Justin), the study had very low statistical power.

In other words, there is a very high probability that they would have failed to detect any population-level relationship, no matter how carefully designed the study or assignment of students into treatment or control group. The primary reason for this is that the sample size is just too low. In this case, the statistical power is 0.37, assuming they used two-tailed t tests and the max effect size for any test is 0.2. That is, there is a 67% chance that they will fail to detect significant differences from this sample, if they do exist. Had the researchers used a more sophisticated data-analytic method (e.g multiple regression modeling), or had they been able to increase their sample size under the current analytic paradigm, these differences may have been significant. At this point, it is impossible to say. Does that mean that lack of significant findings can be interpreted as "we just don't know"? Not at all. We must believe our indications and fail to reject the null hypotheses in 9 out of 10 cases, in this case that there is no difference between the two groups in the population (again, assuming a two-tailed t-test).

In all, it's great that the research is being done. But, like much education-technology based research, it suffers from a lack of statistical power.

For further reading on conducting power analyses pre and post hoc, see Methods Matter: Improving Causal Inference in Educational and Social Science Research by Murnane & Willett (2010).

## Education in the Developing World: “There’s an app for that” - The Dewey Digest

November 29, 2012[…] said Mike Muir, who helped direct the program. While the results of the program showed that iPads only slightly increased learning, the conclusion of the study was that “iPads increased the kindergartners’ literacy scores.” […]

## Alarm over the school iPad bandwagon | MetaSoup

October 1, 2013[…] [5] http://www.edtechresearcher.com/2012/02/are-ipads-making-a-significant-difference-findings-from-aubu… […]