Science claims to be self-correcting. When a finding is wrong, the methods of science — replication, open reporting, expert evaluation — are supposed to catch and correct the error. The institution that embodies this self-correction most visibly is peer review: the process by which expert scientists evaluate each other's work before it is published and enters the scientific record.
For most of the twentieth century, peer review was treated as the gold standard of scientific quality control, so embedded in scientific culture that published results were often treated as synonymous with verified results. Then, from about 2011 onward, a series of high-profile failures — massive replication failures in psychology, fraud cases in medicine, systematic problems in nutrition science — forced a reckoning. The institution that was supposed to ensure scientific quality had significant, systematic flaws.
Understanding peer review today means understanding both what it does well and where it fails — and what the scientific community is doing about it.
How Peer Review Works
When a researcher submits a manuscript to a scientific journal, the process typically proceeds as follows:
Editorial triage: The journal's editor reads the submission and decides whether it is within scope and of sufficient quality to warrant external review. Many submissions are rejected at this stage without review.
Reviewer selection: For submissions that pass triage, the editor identifies two to four experts in the relevant field and invites them to review the manuscript. Reviewers are typically chosen based on expertise, availability, and absence of conflicts of interest. Most review is unpaid volunteer work.
Blind or double-blind review: In single-blind review, reviewers know the authors' identities; authors do not know the reviewers'. In double-blind review, neither party knows the other's identity. Some journals use open review, where both parties know each other and reviews may be published alongside the article.
Review and recommendation: Reviewers read the manuscript and typically provide several pages of written feedback. They recommend one of: accept as is, minor revision, major revision, or reject. They assess the quality of methods, the soundness of statistical analysis, the appropriate interpretation of results, and the significance of the findings.
Decision: The editor makes a final decision based on reviewer recommendations. For revisions, authors respond to comments and resubmit; the process may repeat multiple rounds.
Publication: Accepted manuscripts are copy-edited, typeset, and published, entering the permanent scientific record.
The entire process typically takes three months to two years. High-impact journals such as Nature, Science, and The Lancet have rejection rates of 80–95%. The resulting "peer-reviewed literature" is the primary medium through which scientific knowledge is communicated, stored, and built upon. PubMed, the primary index for biomedical literature, indexes over 35 million records, the vast majority of which are peer-reviewed.
The History of Peer Review
The history of formal peer review is shorter than most people assume. The Philosophical Transactions of the Royal Society, founded in 1665 and often cited as the first scientific journal, did include some editorial selectivity, but this was more like editorial judgment than external peer review. The secretary of the Royal Society, Henry Oldenburg, who founded the journal, largely decided what to publish based on his own assessment and the reputation of the submitting member.
The modern peer review system — external expert evaluation before publication — became standard practice only in the mid-twentieth century. The journal Nature did not introduce formal peer review until 1967. The rapid spread of peer review as a norm happened without much evidence that it was superior to alternatives; it was adopted largely because it was institutionally convenient as science expanded and the volume of submissions grew too large for editorial judgment alone to manage.
| Year | Development |
|---|---|
| 1665 | Philosophical Transactions of the Royal Society founded (editorial review begins) |
| 1752 | Royal Society establishes a Committee on Papers (early systematic review) |
| 1936 | Annals of Mathematics begins using peer reviewers systematically |
| 1967 | Nature introduces formal peer review |
| 1970s–80s | Most major scientific journals adopt peer review as standard |
| 1991 | arXiv preprint server launched (physics); begins challenge to peer review monopoly on scientific communication |
| 2013 | PLOS ONE introduces post-acceptance review model (accept if methodology is sound, regardless of significance) |
| 2020 | COVID-19 pandemic accelerates preprint adoption massively, with both benefits and risks |
What Peer Review Actually Catches
A common misconception is that peer review verifies the results of a study. It does not — and cannot. Reviewers see only the manuscript, not the raw data, the lab notebooks, or the actual experimental process. They cannot determine whether the data was collected as described, whether all analyses were reported, or whether the findings were cherry-picked from a larger set of unreported results.
What peer review can and does catch:
- Obvious methodological errors: Incorrect statistical tests, inappropriate experimental designs, obvious confounds
- Weak claims relative to evidence: Conclusions that go beyond what the data actually shows
- Missing citations and context: Important prior work the authors failed to engage with
- Presentation problems: Unclear writing, missing details that would prevent replication
- Some forms of inconsistency: Numbers in tables that do not add up, results that contradict each other
What peer review typically cannot catch:
- Fraud and fabrication: Reviewers have no access to raw data in most systems
- p-hacking and selective reporting: Reporting only the analyses that gave significant results
- HARKing: Presenting post-hoc hypotheses as pre-specified
- Questionable research practices: Flexible analysis choices that inflate significance
- False positives due to underpowered studies: A perfectly honest study with too small a sample can produce false positive results
Richard Smith, former editor of the British Medical Journal, conducted a landmark experiment in the 1980s in which he inserted eight deliberate errors into a paper submitted to reviewers. Reviewers found on average only one to two errors. No reviewer found more than five. The experiment has been replicated with similar results multiple times, including by Fiona Godlee and colleagues (1998), who inserted nine errors into a manuscript and found that individual reviewers detected a median of only two. Peer reviewers, even expert ones, are not reliable error-detection systems.
"Peer review is a flawed process, full of easily identified defects with little evidence that it works. Yet it is likely to remain central to science and journals because there is no obvious alternative, and scientists and editors have a continuing faith in peer review." — Richard Smith, British Medical Journal (2006)
The Replication Crisis: When the System Failed
From roughly 2011 onward, a series of large-scale replication attempts revealed that significant proportions of published findings in several fields could not be reproduced.
The landmark paper was Simmons, Nelson, and Simonsohn's "False-Positive Psychology" (2011), published in Psychological Science, which demonstrated how standard flexible analytical practices could generate statistically significant results for manifestly false hypotheses — including the claim that listening to a Beatles song about aging made experimental participants literally younger. This opened systematic inquiry into how widespread these practices were.
The Open Science Collaboration's 2015 project "Estimating the Reproducibility of Psychological Science," published in Science, attempted to replicate 100 studies published in top psychology journals. Only 36–39% (depending on the metric) produced results consistent with the original. The effect sizes in replications were on average about half the size of the originals. Many findings that had attracted enormous popular attention — ego depletion (the idea that self-control is a limited resource that depletes with use), facial feedback hypothesis (that holding a pencil in your teeth makes cartoons funnier), social priming effects on behavior — either failed to replicate entirely or replicated with substantially smaller effects.
Other domains showed similar problems:
- Cancer biology: The Reproducibility Project: Cancer Biology (Nosek et al., 2022, eLife) found that of 193 experiments from 53 high-profile papers, fewer than half replicated even partially. Effect sizes in replications averaged substantially below the original claims.
- Nutrition science: Studies consistently show that dietary research, much of it based on unreliable self-reported food frequency questionnaires, produces findings that reverse across decades. The history of dietary fat recommendations — which shifted from low-fat to low-carbohydrate and back with each successive wave of observational studies — is a case study in how nutritional epidemiology's structural weaknesses produce misleading findings.
- Preclinical medical research: John Ioannidis and colleagues estimated in a widely cited paper (PLOS Medicine, 2005) that most published research findings are false when accounting for the testing landscape and publication bias.
Why the System Produced These Failures
The replication crisis did not happen because scientists are dishonest. It happened because the incentive structure of science encouraged practices that systematically inflate false-positive rates.
Publication bias: Journals preferentially publish significant results. Null results — "we tested this and found nothing" — are rarely published, even when the non-finding is important. This creates a literature that overstates effect sizes, because the file drawer full of null results never appears. Sterling (1959) first documented publication bias in psychology, finding that 97% of articles in four major psychology journals reported statistically significant results — a proportion that would be virtually impossible if the null hypothesis were true for most tested effects.
Small sample sizes: Many fields ran standard experiments with 20–30 participants. At these sample sizes, a single study has very low power to detect real effects and high probability of false positives. False-positive results, by definition, replicate poorly. A study with 30 participants testing a small-to-medium effect has statistical power of around 40–50% — meaning it will detect the effect less than half the time even when it exists.
P-hacking: With flexible analytical choices (which outliers to exclude, which covariates to include, when to stop collecting data), researchers can run multiple analyses and report only those that reach p < .05. This is statistical multiple comparison inflation presented as a single test. Wicherts et al. (2016) catalogued 34 researcher degrees of freedom available in a typical psychology study, estimating that the combined flexibility could allow virtually any hypothesis to be "supported" with p < .05.
Incentive misalignment: Careers are built on publications in high-impact journals. Journals reward novelty and significance. Replications are rarely published. Null results are harder to publish. This creates systematic pressure toward positive results regardless of researcher intentions. Smaldino and McElreath (2016) modeled this incentive structure formally and showed that natural selection operating on research groups would systematically favor those that adopted publication-maximizing practices over those that prioritized rigor.
Reforms: What Science Is Doing About It
The scientific community has responded to the replication crisis with a substantial set of methodological reforms, not all of which have been equally adopted.
Pre-Registration
Pre-registration requires researchers to publicly register their hypotheses, sample sizes, and analysis plans before collecting data. This makes it impossible to present post-hoc analyses as confirmatory and prevents HARKing.
The Center for Open Science's Open Science Framework hosts pre-registrations at no cost. Registered Reports, offered by an increasing number of journals (over 300 as of 2023), accept or reject papers based on their methods before results are known — removing the publication bias toward significant results. Studies accepted as Registered Reports are published regardless of whether the results are significant, positive, or null. Early evidence suggests that Registered Reports produce much lower rates of significant results than standard submissions, consistent with the hypothesis that pre-registration reduces p-hacking.
Open Data and Materials
Journals and funding agencies increasingly require researchers to share raw data and analysis code. The FAIR data principles (Findable, Accessible, Interoperable, Reusable), formalized by Wilkinson et al. (2016) in Scientific Data, provide a framework for data sharing that enables verification, meta-analysis, and detection of errors. Repositories including Zenodo, the Open Science Framework, and Dryad provide infrastructure for open data sharing.
This allows other researchers to verify analyses, detect errors, and conduct meta-analyses with more complete datasets. Several major fraud cases — including the Diederik Stapel case in social psychology and the Hwang Woo-suk fraud in stem cell research — were eventually uncovered partly through statistical anomalies in published data that independent analysts identified.
Larger Sample Sizes and Power Analysis
Many fields have moved toward requiring prospective power calculations — estimating necessary sample sizes before running a study — and toward much larger samples than historically typical. This reduces both false positives and the inflation of effect sizes. The introduction of Many Labs replication projects, in which the same study is run simultaneously in dozens of independent laboratories worldwide, provides especially high-powered estimates of effect sizes and replicability.
Registered Replication Reports
Several journals now publish large-scale, pre-registered replication attempts. The Many Labs series, the Psychological Science Accelerator, and the StudySwap initiative coordinate multi-site replications. These studies, in which the same protocol is run in multiple independent laboratories, provide much stronger evidence about whether an effect is real than any single replication study.
Preprints
Preprint servers — arXiv for physics and mathematics, bioRxiv for biology, medRxiv for medicine, PsyArXiv for psychology — allow researchers to post manuscripts publicly before peer review. This enables immediate access to findings, community feedback, and reduces the delay between discovery and communication.
The COVID-19 pandemic demonstrated both the power and the risk of preprints. Early research on viral transmission, vaccines, and treatments was shared within days of completion, enabling rapid scientific progress. The mRNA vaccine platform, for instance, benefited from immediate sharing of the SARS-CoV-2 spike protein sequence. But unreviewed, sometimes flawed or incorrect papers also reached policymakers and the public, occasionally causing harm — most notably early preprints suggesting hydroxychloroquine was effective against COVID-19, which influenced treatment decisions before proper review.
Post-Publication Review
PubPeer is a website that allows researchers to comment on published papers, providing a form of post-publication peer review. Several major retractions have been initiated by PubPeer comments, including the work of cancer researcher Carlo Croce and concerns about image manipulation in dozens of papers across biology and medicine. This is a significant improvement over the old model, where errors in published papers often persisted indefinitely. As of 2024, PubPeer hosts comments on over 3 million published papers.
Predatory Publishing: A New Threat
The open-access movement — making scientific literature freely available rather than behind paywalls — has produced enormous benefits for science communication, but it has also created a new threat: predatory journals.
Predatory journals charge authors publication fees (sometimes substantial) while providing little or no genuine peer review. They operate primarily as revenue-generating enterprises, accepting almost any submission for a fee and providing the veneer of peer-reviewed publication without the substance. Jeffrey Beall, a librarian at the University of Colorado Denver, maintained a list of suspected predatory publishers from 2010 to 2017 that eventually catalogued over 1,000 journals and publishers.
The scale of the problem is substantial. A 2019 study by Manca et al. in BMJ Open estimated that predatory journals published approximately 400,000 articles in 2014 alone. A sting operation by reporter John Bohannon in 2013 submitted a deliberately flawed, fabricated paper to 304 open-access journals; 157 accepted it without meaningful review.
Distinguishing legitimate open-access journals from predatory ones requires attention to whether journals are indexed in established databases (PubMed, Web of Science), whether they are members of the Directory of Open Access Journals (DOAJ), and whether their editorial boards consist of real, identifiable experts.
The Future of Peer Review
The traditional model of peer review — unpaid, pre-publication, closed, with binary accept/reject outcomes — is under pressure from multiple directions.
Overlay journals publish only papers already posted as preprints, with peer review conducted on the already-public manuscript. This separates review from dissemination.
Open review publishes reviewer reports alongside articles, creating accountability for reviewer quality and allowing readers to evaluate the review process. eLife adopted a version of this model and found it substantially increased reviewer quality and transparency.
Cascading review allows manuscripts rejected from one journal to transfer their reviews to a lower-tier journal, reducing duplicate review work and reviewer burden. Some major publishers have implemented cascading systems across their journal portfolios.
AI-assisted review is beginning to be used to check statistical reporting, detect inconsistencies, and flag potential image manipulation — automating some of the error-detection work that human reviewers perform poorly. Tools like statcheck (Epskamp and Nuijten, 2016) automatically verify the reported statistics in psychology papers against the raw numbers, identifying inconsistencies. When statcheck was applied to a corpus of psychology publications, it found statistical errors in about 50% of papers, with about 13% containing errors serious enough to potentially change the paper's conclusion.
Diamond open access — where journals are funded by institutions or grants rather than by author fees or subscriptions — aims to remove financial incentives that distort both predatory publishing and conventional publishing.
None of these reforms will make science infallible. The fundamental challenge is that science is a human enterprise conducted under conditions of uncertainty, career pressure, and limited resources. Peer review is not a truth machine; it is an error-reduction system that operates probabilistically. The question is not how to make it perfect but how to make it better — to reduce the systematic biases and incentive distortions that allowed the replication crisis to develop undetected for decades.
What This Means for Readers of Science
For the general reader trying to evaluate scientific claims, several practical takeaways emerge:
"Peer-reviewed" is not a quality guarantee. It means a manuscript passed editorial and reviewer scrutiny, which catches many but not all errors. Peer review is necessary but not sufficient for a finding to be trustworthy.
Effect size and replication matter more than novelty. Large, dramatic, surprising findings published in a single study should be held tentatively. Findings replicated across multiple independent studies, especially pre-registered ones, carry much stronger evidential weight.
Meta-analyses are generally more reliable than individual studies, but only when based on a reasonably complete literature. Meta-analyses that use only published studies will overestimate effects due to publication bias. Funnel plot asymmetry — where smaller studies show larger effects than larger studies — is a common statistical indicator of publication bias in a body of research.
Open data and pre-registration are positive signals. Papers that pre-registered their analysis plan, share their data openly, and include power calculations are, on average, more trustworthy than papers that do not.
Consensus among researchers matters. A single surprising paper should not overturn your beliefs much. A converging body of evidence, replicated across methodologies, populations, and research groups, is what justifies strong confidence. The GRADE framework (Grading of Recommendations, Assessment, Development, and Evaluations), used by the World Health Organization and Cochrane Collaboration, provides a structured approach to evaluating the certainty of evidence across a body of literature.
Journal prestige is not equivalent to study quality. High-impact journals preferentially publish surprising, novel findings, which are exactly the findings most likely to reflect p-hacking, publication bias, and the winner's curse. The replication crisis hit hardest in studies published in the most prestigious journals.
Conclusion
Peer review is the best quality control mechanism science has yet devised, and it is imperfect in ways that matter enormously. The replication crisis forced an honest reckoning with those imperfections and has produced a wave of methodological reform that is genuinely improving the quality of published research.
The reforms are incomplete and unevenly adopted. Incentive structures — journal prestige, tenure and promotion criteria, funding metrics — still reward novelty over replication. But the scientific community's response to the crisis has been more constructive than defensive, which itself reflects the best ideals of science: being willing to revise beliefs in the face of evidence.
Peer review's purpose is not to certify truth. It is to reduce the probability of error, bias, and fraud entering the scientific literature. That is a more modest but still crucial function — and understanding both what it achieves and where it falls short is essential for anyone who wants to think clearly about scientific evidence.
References
- Smith, R. (2006). Peer review: A flawed process at the heart of science and journals. Journal of the Royal Society of Medicine, 99(4), 178–182.
- Godlee, F., Gale, C. R., & Martyn, C. N. (1998). Effect on the quality of peer review of blinding reviewers and asking them to sign their reports. JAMA, 280(3), 237–240.
- Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349(6251), aac4716.
- Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology. Psychological Science, 22(11), 1359–1366.
- Ioannidis, J. P. A. (2005). Why most published research findings are false. PLOS Medicine, 2(8), e124.
- Nosek, B. A., et al. (2022). Replicability, robustness, and reproducibility in psychological science. Annual Review of Psychology, 73, 719–748.
- Sterling, T. D. (1959). Publication decisions and their possible effects on inferences drawn from tests of significance. Journal of the American Statistical Association, 54, 30–34.
- Wicherts, J. M., et al. (2016). Degrees of freedom in planning, running, analyzing, and reporting psychological studies. Frontiers in Psychology, 7, 1832.
- Smaldino, P. E., & McElreath, R. (2016). The natural selection of bad science. Royal Society Open Science, 3(9), 160384.
- Wilkinson, M. D., et al. (2016). The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data, 3, 160018.
- Epskamp, S., & Nuijten, M. B. (2016). statcheck: Extract statistics from articles and recompute p values. R package version 1.2.2.
- Bohannon, J. (2013). Who's afraid of peer review? Science, 342(6154), 60–65.
- Manca, A., et al. (2019). Predatory publishing: A new threat to science. BMJ Open, 9(3), e028305.
Frequently Asked Questions
What is peer review in science?
Peer review is the process by which a scientific manuscript submitted to a journal is evaluated by independent experts (peers) in the relevant field before being accepted for publication. Reviewers assess the quality, validity, and significance of the research, and may recommend acceptance, revision, or rejection. It is science's primary quality control mechanism.
When did peer review begin?
Formal peer review as we know it is surprisingly recent. The Philosophical Transactions of the Royal Society began some form of editorial review in 1665, but systematic external peer review became standard practice only in the mid-20th century. Nature introduced formal peer review in 1967, and most major journals adopted it through the 1970s and 80s.
What is publication bias and how does it affect science?
Publication bias is the tendency of journals to preferentially publish studies with positive, statistically significant results over studies with null or negative results. This creates a distorted literature: when a treatment works, many studies get published; when it does not work, those studies languish in file drawers. Meta-analyses built on this biased literature overestimate treatment effects.
What are preprints and how do they change scientific communication?
Preprints are scientific manuscripts posted publicly before peer review, typically to servers like arXiv (physics and math), bioRxiv (biology), or medRxiv (medicine). They allow immediate access to research findings before the peer review process, which can take months to years. The COVID-19 pandemic demonstrated both the value of preprints (rapid information sharing) and their risks (unvetted claims reaching the public).
What is the replication crisis and what caused it?
The replication crisis refers to the discovery, from roughly 2011 onward, that many findings in psychology, medicine, nutrition science, and other fields could not be reproduced in independent experiments. Causes include publication bias, small sample sizes, p-hacking (analyzing data in multiple ways until a significant result appears), HARKing (presenting hypotheses as having been pre-specified when they were generated after seeing the data), and insufficient rigor in pre-registration.