Estimating the reproducibility of psychological science

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Empirically analyzing empirical evidence

One of the central goals in any scientific endeavor is to understand causality. Experiments that seek to demonstrate a cause/effect relation most often manipulate the postulated causal factor. Aarts et al. describe the replication of 100 experiments reported in papers published in 2008 in three high-ranking psychology journals. Assessing whether the replication and the original experiment yielded the same result according to several criteria, they find that about one-third to one-half of the original findings were also observed in the replication study.

Science , this issue [Related article:]10.1126/science.aac4716

Abstract

A large-scale assessment suggests that experimental reproducibility in psychology leaves a lot to be desired.

Abstract

INTRODUCTION

Reproducibility is a defining feature of science, but the extent to which it characterizes current research is unknown. Scientific claims should not gain credence because of the status or authority of their originator but by the replicability of their supporting evidence. Even research of exemplary quality may have irreproducible empirical findings because of random or systematic error.

RATIONALE

There is concern about the rate and predictors of reproducibility, but limited evidence. Potentially problematic practices include selective reporting, selective analysis, and insufficient specification of the conditions necessary or sufficient to obtain the results. Direct replication is the attempt to recreate the conditions believed sufficient for obtaining a previously observed finding and is the means of establishing reproducibility of a finding with new data. We conducted a large-scale, collaborative effort to obtain an initial estimate of the reproducibility of psychological science.

RESULTS

We conducted replications of 100 experimental and correlational studies published in three psychology journals using high-powered designs and original materials when available. There is no single standard for evaluating replication success. Here, we evaluated reproducibility using significance and P values, effect sizes, subjective assessments of replication teams, and meta-analysis of effect sizes. The mean effect size (r) of the replication effects ( M _r = 0.197, SD = 0.257) was half the magnitude of the mean effect size of the original effects ( M _r = 0.403, SD = 0.188), representing a substantial decline. Ninety-seven percent of original studies had significant results ( P < .05). Thirty-six percent of replications had significant results; 47% of original effect sizes were in the 95% confidence interval of the replication effect size; 39% of effects were subjectively rated to have replicated the original result; and if no bias in original results is assumed, combining original and replication results left 68% with statistically significant effects. Correlational tests suggest that replication success was better predicted by the strength of original evidence than by characteristics of the original and replication teams.

CONCLUSION

No single indicator sufficiently describes replication success, and the five indicators examined here are not the only ways to evaluate reproducibility. Nonetheless, collectively these results offer a clear conclusion: A large portion of replications produced weaker evidence for the original findings despite using materials provided by the original authors, review in advance for methodological fidelity, and high statistical power to detect the original effect sizes. Moreover, correlational evidence is consistent with the conclusion that variation in the strength of initial evidence (such as original P value) was more predictive of replication success than variation in the characteristics of the teams conducting the research (such as experience and expertise). The latter factors certainly can influence replication success, but they did not appear to do so here.

Reproducibility is not well understood because the incentives for individual scientists prioritize novelty over replication. Innovation is the engine of discovery and is vital for a productive, effective scientific enterprise. However, innovative ideas become old news fast. Journal reviewers and editors may dismiss a new test of a published idea as unoriginal. The claim that “we already know this” belies the uncertainty of scientific evidence. Innovation points out paths that are possible; replication points out paths that are likely; progress relies on both. Replication can increase certainty when findings are reproduced and promote innovation when they are not. This project provides accumulating evidence for many findings in psychological research and suggests that there is still more work to do to verify whether we know what we think we know.

Abstract

Reproducibility is a defining feature of science, but the extent to which it characterizes current research is unknown. We conducted replications of 100 experimental and correlational studies published in three psychology journals using high-powered designs and original materials when available. Replication effects were half the magnitude of original effects, representing a substantial decline. Ninety-seven percent of original studies had statistically significant results. Thirty-six percent of replications had statistically significant results; 47% of original effect sizes were in the 95% confidence interval of the replication effect size; 39% of effects were subjectively rated to have replicated the original result; and if no bias in original results is assumed, combining original and replication results left 68% with statistically significant effects. Correlational tests suggest that replication success was better predicted by the strength of original evidence than by characteristics of the original and replication teams.

Related collections

Most cited references 38

Record: found
Abstract: not found
Article: not found

Conducting Meta-Analyses inRwith themetaforPackage

Wolfgang Viechtbauer (2010)

0 comments Cited 2816 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

Calculating and reporting effect sizes to facilitate cumulative science: a practical primer for t-tests and ANOVAs

Daniël Lakens (2013)

Effect sizes are the most important outcome of empirical studies. Most articles on effect sizes highlight their importance to communicate the practical significance of results. For scientists themselves, effect sizes are most useful because they facilitate cumulative science. Effect sizes can be used to determine the sample size for follow-up studies, or examining effects across studies. This article aims to provide a practical primer on how to calculate and report effect sizes for t-tests and ANOVA's such that effect sizes can be used in a-priori power analyses and meta-analyses. Whereas many articles about effect sizes focus on between-subjects designs and address within-subjects designs only briefly, I provide a detailed overview of the similarities and differences between within- and between-subjects designs. I suggest that some research questions in experimental psychology examine inherently intra-individual effects, which makes effect sizes that incorporate the correlation between measures the best summary of the results. Finally, a supplementary spreadsheet is provided to make it as easy as possible for researchers to incorporate effect size calculations into their workflow.

0 comments Cited 1560 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Power failure: why small sample size undermines the reliability of neuroscience.

Katherine S. Button, John Ioannidis, Claire Mokrysz … (2014)

A study with low statistical power has a reduced chance of detecting a true effect, but it is less well appreciated that low power also reduces the likelihood that a statistically significant result reflects a true effect. Here, we show that the average statistical power of studies in the neurosciences is very low. The consequences of this include overestimates of effect size and low reproducibility of results. There are also ethical dimensions to this problem, as unreliable research is inefficient and wasteful. Improving reproducibility in neuroscience is a key priority and requires attention to well-established but often ignored methodological principles.

0 comments Cited 1137 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Journal

Title: Science

Abbreviated Title: Science

Publisher: American Association for the Advancement of Science (AAAS)

ISSN (Print): 0036-8075

ISSN (Electronic): 1095-9203

Publication date Created: August 28 2015

Publication date (Print): August 28 2015

Volume: 349

Issue: 6251

Article

DOI: 10.1126/science.aac4716

PubMed ID: 26315443

SO-VID: 2c25773d-d19f-497f-a253-196a44a42d7b

License:

http://www.sciencemag.org/about/science-licenses-journal-article-reuse

History

Data availability:

Comments

Comment on this article

scite_

Cited by 1,773

See all cited by

- Version 1

Estimating the reproducibility of psychological science

Read this article at

Empirically analyzing empirical evidence

Abstract

Abstract

INTRODUCTION

RATIONALE

RESULTS

CONCLUSION

Abstract

Related collections

Open Research, Open Science, Open Scholarship

Most cited references 38

Conducting Meta-Analyses inRwith themetaforPackage

Calculating and reporting effect sizes to facilitate cumulative science: a practical primer for t-tests and ANOVAs

Power failure: why small sample size undermines the reliability of neuroscience.

Author and article information

Journal

Article

History

Comments

Comment on this article

Similar content 86

Cited by 1,773