In 1998, a pharmaceutical company named Amgen began an internal project that would not be published for fourteen years. The project's purpose was to replicate the preclinical findings -- the foundational laboratory discoveries made in academic and industry labs -- that had formed the basis of cancer drug development programs. Before investing hundreds of millions of dollars in clinical trials for new cancer drugs, the company wanted to know whether the science those trials were built on was solid.
In 2012, C. Glenn Begley, then Amgen's head of global cancer research, and Lee Ellis published the results in Nature. Of 53 landmark cancer biology studies they had attempted to replicate -- studies that had been published in the highest-impact journals, had generated hundreds of subsequent citations, and had directly influenced drug development decisions -- only 6 reproduced successfully.
Eleven percent.
The finding caused significant controversy in the scientific community, partly because Begley and Ellis did not name the 47 studies that failed to replicate. Scientists criticized this opacity. But the basic finding -- that the majority of even the most prominent preclinical cancer biology studies could not be reproduced -- was consistent with an accumulating body of evidence from across biomedical research.
That evidence now points to a conclusion that is both well-established and deeply uncomfortable: a substantial fraction of what is published in peer-reviewed scientific journals is wrong.
Key Definitions
P-value: The probability of observing a test statistic at least as extreme as the one obtained, given that the null hypothesis is true. Conventionally, p<0.05 is treated as the threshold for statistical significance.
Statistical significance: A result is statistically significant if the p-value falls below a pre-specified threshold, typically 0.05. Statistical significance does not indicate practical importance, truth, or reproducibility.
Statistical power: The probability that a study will detect a true effect of a given size, given that the effect exists. Low-powered studies frequently miss real effects (Type II error) and, when they do detect effects, systematically overestimate their size.
Positive Predictive Value (PPV): In research methodology, the probability that a statistically significant finding is actually true (i.e., corresponds to a real effect). PPV depends on prior probability, statistical power, and false positive rate.
Publication bias: The systematic tendency for studies with statistically significant or positive results to be more likely to be submitted and accepted for publication than studies with null or negative results.
File drawer problem: The phenomenon in which null results are never published and sit in researchers' "file drawers," creating a distorted literature in which only positive findings are visible.
P-hacking: Selectively reporting analyses, stopping data collection, or altering the research design after seeing preliminary results in ways that increase the probability of obtaining p<0.05.
Researcher degrees of freedom: The multiple decision points in data collection and analysis at which researchers have legitimate choices (which variables to control for, which participants to exclude, when to stop collecting data) that individually seem reasonable but collectively inflate false positive rates.
Winner's curse: The tendency for initial studies to overestimate effect sizes because they need to generate large effects to reach statistical significance with small samples. Subsequent larger studies almost always find smaller effects.
Pre-registration: The practice of publicly registering study hypotheses and analysis plans before data collection, preventing post-hoc rationalization of observed results as confirmatory tests of predicted hypotheses.
Ioannidis 2005: The Mathematical Argument
John Ioannidis, then at Tufts University and now at Stanford, published "Why Most Published Research Findings Are False" in PLOS Medicine in 2005. The paper is the most downloaded article in the history of that journal and one of the most cited papers in all of medical research.
The argument is mathematical rather than empirical. Ioannidis used the framework of positive predictive value (PPV) to calculate, under different assumptions, what fraction of statistically significant findings in a scientific field would be expected to correspond to true effects.
The PPV of a research finding depends on three quantities:
R (the pre-study odds that the tested relationship is true): How plausible was the hypothesis before the study? In exploratory genetic association studies where thousands of variants are tested simultaneously, R is very low. In confirmatory clinical trials of drugs with strong preclinical rationale, R is higher.
(1 - beta): Statistical power. The probability of detecting a real effect if it exists. Many studies in psychology and biomedicine are powered to detect only large effects, and real effects are often moderate or small.
Alpha: The false positive rate. The conventional 0.05 threshold means that even with zero true effects, 5% of tests will return false positives.
Ioannidis showed that when R is low, power is limited (say 40%), and alpha is 0.05, the positive predictive value of a significant finding is approximately 20% -- meaning roughly four out of five significant results in such a field are false positives.
The formula:
PPV = (R x power) / (R x power + alpha)
For a plausible hypothesis (R = 1:1), 80% power, and alpha = 0.05: PPV = (1 x 0.8) / (1 x 0.8 + 0.05) = 0.94
For an exploratory hypothesis (R = 1:10), 40% power, and alpha = 0.05: PPV = (0.1 x 0.4) / (0.1 x 0.4 + 0.05) = 0.44
For a highly exploratory hypothesis (R = 1:100), 20% power, and alpha = 0.05: PPV = (0.01 x 0.2) / (0.01 x 0.2 + 0.05) = 0.04
Ioannidis then introduced a bias factor (u), representing the degree to which research practices (including publication bias, p-hacking, and conflicts of interest) inflate the apparent positive rate. When u is high, PPV decreases further.
"For many current scientific fields, claimed research findings may often be simply accurate measures of the prevailing bias." -- John Ioannidis, PLOS Medicine, 2005
The argument is not that individual researchers are dishonest. It is that the structure of scientific incentives -- publish or perish, reward novelty, punish null results, require p<0.05 -- systematically produces a literature in which false positives are overrepresented.
P-Hacking: How Researcher Degrees of Freedom Inflate False Positives
Joseph Simmons, Leif Nelson, and Uri Simonsohn at the University of Pennsylvania published a 2011 paper in Psychological Science that made the problem of researcher degrees of freedom viscerally concrete. Their title was: "False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant."
To demonstrate the problem, they published a study -- and reported it honestly as a demonstration -- showing that listening to the Beatles song "When I'm Sixty-Four" made people statistically significantly younger in terms of self-reported age, compared to a control group that listened to a different song. The effect was p = 0.040. The finding was published in a peer-reviewed journal.
The finding was, of course, impossible. The researchers achieved it by exercising researcher degrees of freedom: including or excluding participants based on post-hoc criteria, choosing which variables to control for after seeing the data, and running the analysis multiple times at interim data points, stopping when significance was achieved.
Each individual decision was defensible. Excluding a participant who misunderstood instructions is reasonable. Controlling for relevant demographic variables is standard practice. Running a quick interim analysis is understandable when resource constraints are tight. But the cumulative effect of these decisions, when exercised after seeing the data and in the direction of achieving significance, is to transform an underpowered study testing a false hypothesis into a study reporting a "significant" finding.
Simmons, Nelson, and Simonsohn demonstrated mathematically that a researcher who uses just four common researcher degrees of freedom -- collecting additional participants when the initial result is not significant, controlling for one of several possible covariates, excluding one of several possible outlier definitions, and running analyses both including and excluding a second condition -- inflates the Type I error rate from the nominal 5% to over 60%.
In most published work, no fraud is committed. The researcher did not fabricate data. They made reasonable-seeming decisions at each decision point, and those decisions happened to accumulate in a direction that produced significance. The literature records the significant result and discards, in the researcher's file drawer, the twenty analyses that did not reach significance.
Publication Bias: The Invisible Graveyard of Null Results
The problem does not begin with researcher degrees of freedom. It begins with the structure of what gets published.
Robert Rosenthal described the "file drawer problem" in 1979 in Psychological Bulletin. The argument was simple. For every statistically significant published result, there may exist many unpublished null results sitting in researchers' file drawers. If the true effect is zero and 20 independent labs each test it with alpha = 0.05, we expect approximately one of them to observe p < 0.05 and publish. The other 19 find nothing, can't publish, and file their results away. The literature contains exactly one study on this effect, and it shows the effect "exists."
The funnel plot, a meta-analytic diagnostic tool, attempts to detect publication bias by plotting effect sizes against sample sizes. If publication is unbiased, small and large studies should scatter symmetrically around the true effect size. If small studies with null results are unpublished, the funnel plot shows asymmetry: small studies cluster at large, significant effect sizes, while larger studies cluster near zero. Funnel plot asymmetry has been documented across nutrition research, social psychology, and clinical medicine.
A 2010 study by Daniele Fanelli in PLOS ONE analyzed the growth of positive results in scientific literature between 1990 and 2007. Across all fields, the proportion of papers reporting positive results increased from approximately 70% to 86%. This increase was implausibly large if the world had become better at studying true effects. It was entirely consistent with increasing publication bias and selective reporting.
The Replication Crisis: Empirical Evidence
The mathematical argument of Ioannidis and the experimental demonstration of Simmons and colleagues were theoretical and laboratory-based. The definitive empirical evidence came from large-scale systematic replication projects.
Open Science Collaboration 2015
The Open Science Collaboration, a consortium of 270 researchers coordinated by Brian Nosek at the University of Virginia, published results in Science in 2015 of an ambitious project: systematically replicating 100 published psychology studies. Studies were selected from three prominent psychology journals (Journal of Experimental Psychology: General, Journal of Personality and Social Psychology, Psychological Science), covering the calendar year 2008.
Results:
| Metric | Outcome |
|---|---|
| Original studies significant (p<0.05) | 97% |
| Replications significant (p<0.05) | 36% |
| Average replication effect size vs. original | ~50% of original |
| Subjective assessments: "closely replicated" | 39% |
The 36% replication rate was widely reported and remains the most-cited single statistic from the replication crisis. But the effect size finding is in some ways more informative: the average replication produced an effect roughly half the size of the original. This is precisely the pattern predicted by the winner's curse.
Cancer Biology Replication
The Begley/Ellis 2012 finding was followed by the more systematic Reproducibility Project: Cancer Biology, led by Timothy Errington at the Center for Open Science. This project selected 50 high-impact cancer biology studies published between 2010 and 2012 and attempted systematic replication with expert consultation with original study authors.
Results published in 2021 in eLife found that only 50% of the 112 individual experiments they attempted to replicate showed results in the same direction as the original. For experiments with quantitative effect size comparisons, replication effects were on average 85% smaller than original effects.
The finding was more nuanced than Begley and Ellis's earlier informal survey, but the basic message was consistent: canonical cancer biology findings replicated at rates far below what the field's confidence in its literature implied.
The PREDIMED Retraction
Sometimes the failure of research quality is not merely a matter of statistical practice but of fundamental methodological flaws. The PREDIMED trial (Prevension con Dieta Mediterranea) was published in the New England Journal of Medicine in 2013 and generated global headlines. The trial claimed to show that a Mediterranean diet supplemented with olive oil or nuts reduced major cardiovascular events by approximately 30% compared to a low-fat diet, in a primary prevention population.
The study was widely influential, cited in dietary guidelines and clinical practice recommendations.
In 2018, the original paper was retracted and a corrected version republished. The corrections revealed that randomization -- the core methodological protection that makes randomized controlled trials credible -- had been violated in a substantial proportion of sites. Entire families at some sites had been randomized together, violating individual-level random assignment. When corrected analyses were performed, the apparent benefit was substantially attenuated.
The PREDIMED case illustrates a specific failure mode: a paper published in a top-tier journal, influencing clinical practice globally, built on a methodological flaw that peer review failed to detect for five years.
The Winner's Curse: Why First Studies Overestimate Effects
Andrew Gelman, a statistician at Columbia University, and John Carlin published a paper in Perspectives on Psychological Science in 2014 describing what they called "the winner's curse" in scientific research.
The mechanism is straightforward. Suppose a true effect exists in a population but is small -- say, a standardized effect size of d = 0.2. A researcher conducts a study with 50 participants per group. This study has low statistical power: approximately 20% chance of detecting d = 0.2 at p<0.05. This means 80% of such studies will not find significance and will not be published.
The 20% of studies that do find significance are the ones where sampling variation happened to produce an effect estimate substantially larger than the true d = 0.2 -- perhaps d = 0.5 or d = 0.6. Only these inflated estimates cross the significance threshold. These are the studies that get published.
The first published result will therefore typically overestimate the true effect by a factor that can be two to four times the true effect size. This initial overestimate becomes the benchmark for subsequent research. When larger, better-powered studies are conducted and find smaller effects, they are interpreted as partial failures to replicate -- even though they are actually more accurate estimates of the true effect.
"The statistical significance filter... means that if you do manage to get a significant result with a small sample, you've almost certainly badly overestimated the effect size." -- Andrew Gelman, Statistical Modeling, Causal Inference, and Social Science, 2014
The winner's curse has direct implications for meta-analyses. A meta-analysis that combines published studies without accounting for publication bias will compute an average effect size inflated by the winner's curse. The celebrated findings of social psychology, many showing effect sizes of d = 0.5 to d = 0.8, are now believed to reflect winner's curse inflation. Meta-analyses correcting for publication bias using funnel plot methods often produce effect size estimates 50-80% smaller than the uncorrected literature average.
Which Fields Are Most Affected
The replication crisis is not uniformly distributed across science. Some fields are substantially more affected than others, for reasons that track the predictors Ioannidis identified: prior probability of hypotheses, sample sizes, degree of researcher flexibility, and strength of publication bias.
| Field | Replication Concern | Key Factors |
|---|---|---|
| Social psychology | High | Small samples, many researcher degrees of freedom, high prior plausibility of null |
| Nutrition epidemiology | High | Observational design, many possible confounders, high media interest driving bias |
| Preclinical cancer biology | High | Small animal samples, high mechanistic flexibility, commercial pressure |
| Clinical medicine (RCTs) | Moderate | Pre-registration improving; early non-replications still documented |
| Genomics/GWAS | Low | Mandatory pre-registration, large samples, rigorous replication standards |
| Physics | Very low | Hard constraints, precise prediction, blind analysis standard |
| Chemistry | Low | Reproducibility of synthesis relatively verifiable; measurement precise |
Prasad and Cifu's "Ending Medical Reversal" (2015) analyzed established medical practices subsequently overturned when properly tested. They found that approximately 40% of established medical interventions they reviewed -- practices that had been in clinical use based on prior evidence -- were overturned or substantially revised when subjected to more rigorous testing. This included practices in cardiology, oncology, and preventive medicine.
What Good Science Looks Like
The replication crisis has produced a reform movement in science. The structural interventions with the most evidence behind them:
Pre-registration: Registering hypotheses, primary outcomes, sample sizes, and analysis plans in a public repository before data collection begins. Pre-registration prevents hypotheses from being revised after results are seen (HARKing: Hypothesizing After Results are Known). Journals including the British Medical Journal and Psychological Science now offer registered report formats in which peer review occurs before data collection and publication is guaranteed regardless of results.
Open data: Making raw data publicly available allows independent verification and enables detection of errors. A 2017 audit of 260 published psychology papers with available data found that roughly half contained at least one reporting error, and approximately 13% contained an error large enough to potentially affect the conclusion.
Large, adequately powered studies: Underpowered studies are the single most tractable source of the winner's curse. Studies powered to detect the smallest effect of scientific or practical interest, with sufficient samples to constrain effect size estimates, produce more reliable literature.
Independent replication before clinical or policy application: The medical community has begun treating single trials, however impressive, as insufficient basis for clinical recommendation. Network meta-analyses combining multiple independent studies with pre-specified protocols have become standard in evidence-based medicine.
Registered reports: A publication format in which acceptance is based on the quality of the question, design, and analysis plan, before results are seen. The journal commits to publication regardless of outcome, directly addressing publication bias at its source.
Conclusion
The evidence that a substantial fraction of published research is false is now extensive and multi-source. Ioannidis's mathematical argument showed it was predictable. Simmons and colleagues demonstrated it was achievable by accident. The Open Science Collaboration measured it empirically at 36% replication in psychology. Begley and Ellis documented it at 11% in landmark cancer biology. Fanelli documented the suspicious increase in positive results across all fields. And Prasad and Cifu found that 40% of established medical practices reversed when properly tested.
The message is not that science is broken. It is that science's self-correction mechanism -- replication and scrutiny over time -- is real but slow, and that the interim literature is less reliable than its authoritative presentation typically indicates.
The reforms already underway are substantive. Pre-registration rates in clinical trials increased dramatically after the International Committee of Medical Journal Editors made registration a condition of publication in 2004. Genome-wide association studies now have mandatory pre-registration and replication requirements as a condition of publication, and have produced a much more reliable genetics literature as a result. Registered reports are being adopted in psychology, nutrition science, and neuroscience.
The consumer of scientific information -- including journalists, policymakers, clinicians, and the educated public -- needs a calibrated skepticism about single studies. A single study published in a prestigious journal showing that a dietary supplement prevents dementia, or that a brief intervention produces lasting behavioral change, is the weakest form of scientific evidence. Replication, particularly pre-registered replication in independent labs with samples large enough to detect modest effects, is the standard the literature needs to meet before confident claims are warranted.
Science as a collective, self-correcting process over decades is among humanity's most reliable instruments for distinguishing true from false. Individual studies, published today, subject to all the incentive pressures Ioannidis documented, are not.
The difference between these two things is among the most important things that scientific literacy requires understanding.
References
Ioannidis, J. P. A. (2005). Why most published research findings are false. PLOS Medicine, 2(8), e124. https://doi.org/10.1371/journal.pmed.0020124
Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22(11), 1359-1366. https://doi.org/10.1177/0956797611417632
Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349(6251), aac4716. https://doi.org/10.1126/science.aac4716
Begley, C. G., & Ellis, L. M. (2012). Drug development: Raise standards for preclinical cancer research. Nature, 483(7391), 531-533. https://doi.org/10.1038/483531a
Errington, T. M., Mathur, M., Soderberg, C. K., Denis, A., Perfito, N., Iorns, E., & Nosek, B. A. (2021). Investigating the replicability of preclinical cancer biology. eLife, 10, e71601. https://doi.org/10.7554/eLife.71601
Gelman, A., & Carlin, J. (2014). Beyond power calculations: Assessing Type S (sign) and Type M (magnitude) errors. Perspectives on Psychological Science, 9(6), 641-651. https://doi.org/10.1177/1745691614551642
Rosenthal, R. (1979). The file drawer problem and tolerance for null results. Psychological Bulletin, 86(3), 638-641. https://doi.org/10.1037/0033-2909.86.3.638
Fanelli, D. (2010). "Positive" results increase down the hierarchy of the sciences. PLOS ONE, 5(4), e10068. https://doi.org/10.1371/journal.pone.0010068
Prasad, V., & Cifu, A. (2015). Ending medical reversal: Improving outcomes, saving lives. Johns Hopkins University Press.
Estruch, R., Ros, E., Salas-Salvado, J., Covas, M. I., Corella, D., Aros, F., Gomez-Gracia, E., Ruiz-Gutierrez, V., Fiol, M., Lapetra, J., Lamuela-Raventos, R. M., Serra-Majem, L., Pinto, X., Basora, J., Munoz, M. A., Sorli, J. V., Martinez, J. A., Fito, M., Gea, A., ... Martinez-Gonzalez, M. A. (2018). Primary prevention of cardiovascular disease with a Mediterranean diet supplemented with extra-virgin olive oil or nuts. New England Journal of Medicine, 378(25), e34. https://doi.org/10.1056/NEJMoa1800389
Frequently Asked Questions
What did Ioannidis argue in his 2005 paper?
John Ioannidis showed mathematically that when prior probability of a hypothesis is low, statistical power is limited, and researcher bias exists, the majority of statistically significant findings in a literature will be false positives. The positive predictive value of a significant result depends critically on how plausible the hypothesis was before testing.
What is p-hacking?
P-hacking refers to selectively reporting analyses that yield p<0.05 while omitting analyses that do not. Simmons, Nelson, and Simonsohn demonstrated in 2011 that using common researcher degrees of freedom -- stopping data collection when p<0.05, adding covariates, excluding outliers selectively -- can produce a significant result for the claim that listening to Beatles songs makes people physically younger.
What did the Open Science Collaboration find?
The 2015 Open Science Collaboration replicated 100 published psychology experiments and found that only 36 produced statistically significant results in the same direction as the original. Average effect sizes were roughly half the original reported sizes, consistent with the winner's curse.
How bad was the replication problem in cancer biology?
C. Glenn Begley and Lee Ellis reported in 2012 that Amgen scientists attempted to replicate 53 landmark cancer biology studies and found only 6 reproduced successfully. The 89% failure rate for studies that had formed the basis of drug development programs represented enormous wasted investment and delayed treatments.
What is the winner's curse in research?
Andrew Gelman and John Carlin identified the winner's curse: studies with small samples require large effect sizes to reach statistical significance. The first studies that cross the significance threshold therefore overestimate true effect sizes, sometimes by a factor of two or more. Subsequent larger studies almost always find smaller effects.
Which scientific fields are most and least affected by the replication problem?
Nutrition science, social psychology, and preclinical cancer biology have shown the highest rates of non-replication. Genetics and epidemiology have improved substantially through mandatory pre-registration of GWAS studies and large sample requirements. Physics, chemistry, and clinical trials with large samples and pre-registration are generally more replicable.
What does good science look like structurally?
Pre-registration of hypotheses and analysis plans before data collection, large and adequately powered samples, open data sharing, independent replication, and registered reports (where journals commit to publication before seeing results) are the structural features that reliably improve the validity of scientific findings.