How the Scientific Method Works

Q: "What is falsificationism and why did Popper think it mattered?"

"Karl Popper's falsificationism, developed in his 1934 book 'Logik der Forschung' (translated as 'The Logic of Scientific Discovery' in 1959), was his answer to two intertwined problems: the problem of induction and the problem of demarcation. David Hume had established in 1739 that no number of confirming observations can logically prove a universal generalization — observing a million white swans does not prove all swans are white; one black swan falsifies it. Science cannot proceed by accumulating confirmations because confirmation has no logical limit and provides no certainty. Popper's solution was to invert the direction: instead of trying to confirm theories, scientists should try to falsify them. A hypothesis is genuinely scientific if and only if it makes specific predictions that could, in principle, be shown to be wrong. A theory that is compatible with any possible observation — that can explain anything — has no predictive power and therefore no scientific content. Popper used psychoanalysis and Adlerian psychology as examples of unfalsifiable theories: any human behavior could be explained by Freudian or Adlerian concepts after the fact, which meant the theories were never at risk of being disproven. By contrast, Einstein's general theory of relativity made the specific prediction that light would be bent by gravity by a precisely specified amount — a prediction that could be, and was, tested by Arthur Eddington's solar eclipse observations in 1919. Had the result differed from the prediction, the theory would have been falsified. Popper's demarcation criterion — falsifiability as the line between science and non-science — remains influential despite significant philosophical criticism. It is widely used, often imprecisely, in public debates about intelligent design, homeopathy, and other contested knowledge claims."

Q: "What is a paradigm shift and how does scientific consensus actually change?"

"Thomas Kuhn's 1962 book 'The Structure of Scientific Revolutions' transformed how philosophers, historians, and scientists think about how scientific knowledge changes. Kuhn argued that science does not progress through the steady accumulation of facts and theories but through occasional revolutions that replace one entire way of understanding a domain with another. He called these frameworks 'paradigms' — not just theories but entire worldviews, including shared assumptions, methods, exemplary problems and their solutions, and even the perception of what counts as a relevant observation. Normal science, Kuhn argued, is puzzle-solving within a paradigm: scientists take the paradigm for granted and work to extend it, fill in details, and resolve minor inconsistencies. When anomalies accumulate that the paradigm cannot accommodate, a crisis develops, and eventually a revolutionary new paradigm is proposed — Copernican heliocentrism replacing Ptolemaic geocentrism, Newtonian mechanics being superseded by relativity and quantum mechanics, continental drift theory overturning fixed-continent geology. Crucially, Kuhn argued that paradigm shifts are not purely rational events driven by evidence. Scientists are trained within a paradigm and have psychological and professional investments in it; they resist anomalies for a long time before the accumulation becomes untenable. The sociological insight — that scientific communities resist paradigm changes and that normal science is inherently conservative — was controversial and widely misread as implying that science is merely a social construction. Kuhn's actual view was more subtle: paradigm shifts, though not purely rational, are not arbitrary, and science does make genuine progress. But his work permanently complicated the naive view of science as a purely objective enterprise immune to sociology and psychology."

Q: "What is the replication crisis and how serious is it?"

"The replication crisis refers to the widespread finding, beginning around 2011, that a substantial proportion of published scientific findings cannot be reproduced when independent researchers attempt to replicate the original studies. The crisis has affected psychology, medicine, economics, and nutrition science most visibly. In 2015, the Open Science Collaboration published 'Estimating the Reproducibility of Psychological Science' in Science, reporting that only 36 of 100 published psychology experiments replicated successfully when independent teams conducted exact replications. In medicine, John Ioannidis's enormously influential 2005 paper 'Why Most Published Research Findings Are False' in PLOS Medicine used statistical modeling to argue that for research designs common in biomedical science, the majority of published positive findings were likely false positives. The statistical roots of the crisis lie in the misuse and misunderstanding of null hypothesis significance testing. A p-value less than 0.05 does not mean there is a 95% probability that the effect is real; it means that if the null hypothesis were true, you would observe data at least this extreme 5% of the time by chance alone. When researchers run multiple analyses and publish only significant results (p-hacking), when they adjust study designs after seeing preliminary data (HARKing — Hypothesizing After Results are Known), and when small samples produce highly variable results, the published literature becomes systematically misleading. Pre-registration — publicly committing to hypotheses, methods, and analysis plans before data collection — has emerged as the primary solution. Major journals including Nature and Science now encourage or require pre-registration for clinical trials and increasingly for observational studies."

Q: "What separates science from pseudoscience?"

"The demarcation problem — drawing a principled line between science and non-science or pseudoscience — has not been fully solved by philosophers of science, but several criteria have been proposed that are useful in practice. Popper's criterion of falsifiability is the most famous: a claim is scientific if it makes predictions that could be proven wrong. By this criterion, homeopathy is testable (it predicts that highly diluted solutions will have specific therapeutic effects beyond placebo) and has been tested extensively, consistently failing — which makes it bad science rather than non-science. Astrology makes predictions that are falsifiable but systematically fails tests, which similarly makes it a failed science rather than a non-science. Intelligent design fails the falsifiability criterion more completely because it can accommodate any observation about the natural world. Paul Thagard proposed more comprehensive criteria in 1978: pseudoscience is characterized by lack of problem-solving progress over time, by a community that fails to engage seriously with anomalies, that does not evaluate the theory critically, and that promotes the theory to non-specialists without subjecting it to peer scrutiny. By these criteria, astrology qualifies as pseudoscience not just because its predictions fail but because its practitioner community does not respond to failures by revising the theory. The demarcation problem matters practically because it affects public policy (should intelligent design be taught in schools?), medical regulation (should homeopathy be reimbursed by health insurance?), and legal evidence (what scientific testimony is admissible in court?). The Daubert standard in US law attempts to operationalize scientific validity for courtroom purposes."

Q: "What is a p-value and why is it so commonly misunderstood?"

"A p-value is the probability of observing data at least as extreme as the actual data, given that the null hypothesis is true. In the canonical test, the null hypothesis is that there is no effect: for example, that a new drug performs no better than a placebo. If the p-value is 0.03, it means that if the drug had no effect, there would be only a 3% probability of seeing results as dramatic as those observed, simply by chance. By convention (established largely by R.A. Fisher in the 1920s), a p-value below 0.05 is considered 'statistically significant.' This threshold is widely misunderstood in four specific ways. First, a p-value is not the probability that the null hypothesis is true; it is the probability of the observed data given the null hypothesis — a very different thing. Second, statistical significance is not the same as practical or clinical significance. A study with a very large sample can detect effects too small to be of any real-world importance with p less than 0.001. Effect sizes — how big the difference actually is — are as important as whether it exceeds the significance threshold. Third, the 0.05 threshold is arbitrary, not magical. Fourth, p-values from underpowered studies (too few participants) are highly variable and frequently do not replicate. The American Statistical Association issued a 2016 statement clarifying these misunderstandings, and there have been calls to replace the 0.05 significance threshold with pre-specified effect size estimates, Bayesian methods that explicitly incorporate prior probabilities, or requiring replication before publication. The replication crisis has been substantially driven by the widespread misuse and misunderstanding of p-values."

Q: "Does science determine values, and can it tell us what we ought to do?"

"This is the is-ought problem first articulated by David Hume in 'A Treatise of Human Nature' (1739): from statements about what is the case, you cannot derive statements about what ought to be the case without some additional normative premise. Science describes and explains the natural and social world; it cannot by itself prescribe how we should act. A simple example: science can tell us that burning fossil fuels increases atmospheric CO2 and that rising CO2 concentrations will cause significant warming with specific consequences. It cannot tell us, without additional value premises, how much we should sacrifice now to reduce those consequences, how harms and benefits should be distributed across generations, or what obligations wealthy countries have to poorer ones. These are ethical and political questions that science informs but cannot resolve. This does not mean science and values are entirely separate. Values influence which questions scientists choose to study, how results are framed and communicated, and which applications receive funding. Helen Longino's social epistemology argues that scientific objectivity is not a property of individual scientists but of scientific communities with appropriate critical norms: diverse perspectives, public evidence standards, and responsiveness to criticism. Feminist philosophers of science have argued that male-dominated research communities have systematically understudied questions about women's health and over-interpreted data about sex differences. These are not criticisms of science but of specific scientific practices that violate objectivity as a norm. The appropriate relationship between science and values is one of mutual constraint rather than either dominance: values should not determine scientific findings, and scientific findings should inform but not dictate value choices."

Q: "How does peer review work and what are its limits?"

"Peer review is the process by which submitted scientific manuscripts are evaluated by independent experts in the relevant field before publication. In the dominant model, authors submit manuscripts to journals, editors make initial assessments, and manuscripts that pass this gate are sent to typically two or three reviewers who assess the validity of the methods, the quality of the evidence, and the significance of the findings. Reviewers recommend acceptance, revision, or rejection. Peer review serves important functions: it catches methodological errors, ensures appropriate literature engagement, improves the clarity of papers, and provides a quality filter for the published literature. However, its limitations are substantial and well-documented. Peer review is slow, often taking months or years from submission to publication. It is unreliable: studies of peer review have found that reviewers frequently disagree with each other, that journals sometimes reject papers that later become highly cited, and that reviewers fail to detect deliberate errors inserted into manuscripts for testing purposes. It does not detect fraud: the systematic fabrication of data by Diederik Stapel in social psychology, Hwang Woo-suk in stem cell research, and hundreds of others passed peer review without detection. It is subject to bias: studies have found reviewer bias based on authors' institutional affiliation, nationality, and in single-blind review (where reviewers know the authors' identities) gender. Pre-prints — making manuscripts publicly available before peer review, as is standard in physics and increasingly in biology and medicine — address the speed problem and enable broader scrutiny but create different challenges for science communication, since unchecked pre-prints are sometimes reported as established findings."

Scientific — In 1847, the Hungarian physician Ignaz Semmelweis noticed something disturbing about the hospital where he worked in Vienna. The First Maternity Division, staffed by medical students, had a death rate from childbed fever (puerperal fever) of 10-35% in the years between 1841 and 1846. The Second Division, staffed by midwives, had a rate of 1-2%.

Women in labor begged to be admitted to the midwives' ward. Some delivered in the street rather than be taken to the doctors. Semmelweis investigated and found that the medical students went directly from performing autopsies on cadavers to examining laboring women without washing their hands. He hypothesized that "cadaverous particles" from the corpses were being transmitted to patients.

When he introduced mandatory handwashing with chlorinated lime solution in May 1847, the death rate in his ward dropped within months to 1-2%.

His colleagues rejected his findings. The mechanism he proposed — invisible particles transmitted from the dead to the living — was implausible to the dominant medical theory of the day, which attributed childbed fever to miasma (bad air) and constitutional factors. His superior was dismissive.

The leading obstetricians of Europe ignored or openly mocked him. He grew increasingly desperate and erratic, writing increasingly angry open letters to prominent physicians denouncing them as murderers. In 1865, he was committed to a mental institution.

He died there within two weeks, at 47, possibly from the same infection he had spent his career trying to prevent. He was vindicated posthumously when Louis Pasteur's germ theory, developed in the 1860s, provided the mechanistic explanation that Semmelweis had lacked.

The Semmelweis story is sometimes told as an illustration of medical conservatism — of how entrenched interests and wounded pride can suppress life-saving discoveries. But it also illustrates something more fundamental about how scientific knowledge is produced and accepted: the relationship between observation, mechanism, and theory.

Semmelweis had the right intervention and the wrong explanation. His evidence was compelling but his mechanism was wrong. The scientific community that rejected him was not simply being obstinate; it was applying, imperfectly, the demand that explanations fit within a coherent theoretical framework. Understanding why that process works — and why it so often fails — is what the philosophy of science is about.

"Science is not a collection of facts. It is a method for separating the things that seem to be true from the things that are true." — Richard Feynman, The Meaning of It All (1998)

Key Definitions

Falsificationism (Popper): The philosophical position that a hypothesis is scientific only if it makes predictions that could, in principle, be shown to be wrong. Science advances not by confirming theories but by testing and surviving attempts to falsify them.

Paradigm shift (Kuhn): Thomas Kuhn's term for the revolutionary replacement of one scientific framework (paradigm) by another — not a gradual accumulation of evidence but a relatively rapid reconceptualization of an entire field.

Research program (Lakatos): Imre Lakatos's concept of a series of theories sharing a "hard core" of basic assumptions, protected by a "protective belt" of auxiliary hypotheses. A research program is progressive if it generates novel predictions; degenerative if it merely explains anomalies after the fact.

Scientific realism: The philosophical position that successful scientific theories are approximately true descriptions of a mind-independent reality, including unobservable entities like electrons and quarks.

Induction problem (Hume): David Hume's argument that no number of confirming observations can logically justify a universal generalization — the observation of 1,000 white swans does not prove all swans are white.

Hypothesis: A specific, testable prediction derived from a theory, framed in a way that allows empirical investigation.

Null hypothesis: In null hypothesis significance testing, the hypothesis of no effect or no relationship — the baseline assumption against which evidence is weighed.

P-value: The probability of observing data at least as extreme as the actual data, given that the null hypothesis is true. Widely misunderstood as the probability that the null hypothesis is false.

Statistical significance: The conventional threshold (p less than 0.05) below which a result is deemed unlikely to have occurred by chance under the null hypothesis. Not equivalent to practical significance or effect size.

Effect size: The magnitude of an effect, independent of sample size. A statistically significant effect can be practically trivial; a practically important effect may not reach statistical significance in a small sample.

Pre-registration: Publicly committing to hypotheses, methods, and analysis plans before data collection begins, as a safeguard against p-hacking and HARKing (Hypothesizing After Results are Known).

Replication: The independent repetition of a study, ideally by different researchers using different samples, as the gold-standard test of whether a finding is real.

Peer review: The evaluation of a manuscript by independent experts in the relevant field before publication in a scientific journal.

Theory vs law: In science, a theory is an explanatory framework supported by substantial evidence (e.g., germ theory, evolutionary theory); a law is a description of a pattern (e.g., Newton's laws of motion). "Just a theory" is a common misunderstanding — scientific theories are the best available explanations, not mere speculation.

Demarcation problem: The philosophical problem of drawing a principled line between science and non-science or pseudoscience.

Major Philosophies of Science Compared

Philosopher	Core Framework	How Science Progresses	Key Strength	Key Limitation
Francis Bacon (1620)	Empiricism / induction	Accumulate observations, then generalize	Grounded in observable reality	Hume's problem: induction is not logically valid
Karl Popper (1934)	Falsificationism	Theories survive by resisting falsification attempts	Explains asymmetry between confirmation and refutation	Few real theories are ever truly falsified; scientists protect core theories
Thomas Kuhn (1962)	Paradigm shifts	Normal science accumulates anomalies; revolution replaces paradigm	Matches the actual history of science	Underplays the rationality of theory choice; relativism concern
Imre Lakatos (1970)	Research programs	Progressive programs generate novel predictions; degenerative ones only explain anomalies	Handles theory change better than Popper's naive falsificationism	Difficult to judge in real time whether a program is progressive
Paul Feyerabend (1975)	"Anything goes"	No universal scientific method; rule-breaking drives progress	Historically insightful; challenges methodological dogma	Can be used to justify pseudoscience
Bas van Fraassen (1980)	Constructive empiricism	Accept theories as empirically adequate, not literally true	Avoids metaphysical commitments about unobservables	Difficulties distinguishing "observable" from "unobservable"

The Problem of Induction

David Hume, writing in A Treatise of Human Nature in 1739, identified a logical problem that has haunted the philosophy of science ever since. All empirical knowledge claims are based on induction — reasoning from observed instances to general conclusions. We have observed the sun rising every morning for all of recorded history; we conclude it will rise tomorrow. We have observed thousands of white swans; we conclude all swans are white.

But this inference is not logically valid. No finite number of confirming observations can logically entail a universal generalization. The observation of a single black swan (as European naturalists discovered when they reached Australia) immediately falsifies "all swans are white." The problem is not merely theoretical: it means that science cannot, in principle, prove any general claim through accumulation of evidence alone.

This is not a fatal problem for science — it does not mean science cannot produce reliable knowledge. It means that the nature of scientific knowledge is probabilistic rather than certain, and that the appropriate attitude toward scientific claims is one of provisional confidence rather than absolute certainty. The induction problem shapes what science can and cannot do: it can produce well-supported theories that have survived extensive testing; it cannot produce proofs.

The practical consequence for science consumers is significant: when scientists say a claim is "well established," they mean it has survived many attempts to falsify it, not that it has been proven true in the way a mathematical theorem is proven. The possibility of revision always remains — which is a feature, not a bug.

Popper's Falsificationism

Karl Popper, an Austrian-British philosopher who had been disturbed in the 1920s by the apparent unfalsifiability of psychoanalysis and Adlerian psychology, proposed falsificationism as his solution to both the induction problem and the demarcation problem. Published in German in 1934 as Logik der Forschung and in English in 1959 as The Logic of Scientific Discovery, the argument was elegantly simple.

Since science cannot prove theories through induction, it should not try. Instead, science should attempt to falsify theories. A hypothesis is scientific if and only if it makes specific predictions that could, in principle, be shown to be wrong. If no possible observation could contradict a theory, the theory has no empirical content — it tells us nothing about the world.

The Asymmetry of Falsification

Popper identified a logical asymmetry between confirmation and falsification. No finite set of confirming observations proves a universal claim, but a single disconfirming observation logically refutes it. Observing a million white swans does not prove "all swans are white," but observing one black swan proves it false. This asymmetry makes falsification logically powerful in a way that confirmation is not.

Applied to scientific practice, this means: scientists should not seek confirmation of their theories (this proves nothing) but should design the most demanding tests they can think of, tests that the theory must pass to survive. A theory that survives many severe attempts at falsification — that has been genuinely put at risk and not refuted — earns our confidence not because it has been proven true but because it has proven robust.

Popper used Freudian psychoanalysis and Alfred Adler's individual psychology as examples of unfalsifiable theories in their 1930s forms. Any human behavior could be explained by either framework after the fact: if a man jumps into a river to save a drowning child, Freud can explain it one way; if he pushes the child in, Freud can explain that too. A theory that explains everything predicts nothing and tells us nothing.

By contrast, Einstein's general theory of relativity predicted that light from distant stars would be deflected by the sun's gravity by a specific, calculable amount — a prediction that could have been falsified by Eddington's 1919 eclipse observations and was instead confirmed (to within measurement error).

Popper's falsificationism has been criticized on several grounds. Willard Quine argued that any hypothesis is tested only in conjunction with a web of background assumptions, so a failed prediction can always be attributed to a background assumption rather than the hypothesis itself — making falsification less clean than Popper claimed. Thomas Kuhn argued that practicing scientists do not actually behave as Popperian falsificationists and that science works better because they don't.

Imre Lakatos developed a more nuanced account intended to accommodate how science actually works without abandoning rational assessment.

Kuhn's Paradigm Shifts

Thomas Kuhn's 1962 The Structure of Scientific Revolutions is one of the most widely read and most widely misread books in the philosophy of science. Its central argument is that science does not progress through the steady accumulation of facts and the gradual refinement of theories. It progresses through revolutions — relatively rapid episodes in which one entire way of understanding a domain is replaced by another.

Normal Science and Anomalies

Kuhn's key concept is the paradigm: not just a theory but an entire framework of assumptions, methods, exemplary problems and solutions, and perceptual habits. Paradigms include Newton's mechanics, Lavoisier's chemistry, Darwin's evolution, the germ theory of disease. Scientists trained within a paradigm take its basic assumptions for granted and work within its framework — Kuhn calls this "normal science," and characterizes it as puzzle-solving: extending the paradigm, filling in details, resolving inconsistencies, applying it to new domains.

Anomalies — observations that don't fit the paradigm — are routinely encountered in normal science. Scientists don't treat every anomaly as a refutation; they assume the paradigm is correct and the anomaly reflects measurement error, incomplete analysis, or a solvable complication. This is rational: paradigms are supported by so much successful prediction that individual anomalies rarely justify abandoning them.

But when anomalies accumulate and multiply, when the devices invented to accommodate them become increasingly ad hoc, when the paradigm seems unable to generate successful novel predictions, a crisis develops.

Scientific revolutions involve the replacement of one paradigm by another: Ptolemaic geocentrism by Copernican heliocentrism, Newtonian mechanics by relativity and quantum mechanics, the phlogiston theory of combustion by Lavoisier's oxygen chemistry, the fixed-continent geology of the early 20th century by plate tectonics.

Kuhn's controversial sociological claim was that paradigm shifts are not purely rational events driven by evidence. Scientists are trained within a paradigm and have psychological, professional, and social investments in it. They resist paradigm changes for extended periods, often until the older generation dies and is replaced by scientists who trained in the new framework. The philosopher Max Planck had observed something similar: "Science advances one funeral at a time."

Kuhn's work was widely misread as implying that scientific progress is purely social, that paradigms are incommensurable (cannot be rationally compared), and that the choice between competing paradigms is arbitrary. These readings prompted fierce responses from philosophers who argued that Kuhn had made science irrational.

Kuhn spent much of his later career clarifying that he had not intended this conclusion: paradigm shifts, though not purely rational, are not arbitrary, and science does make genuine progress.

Lakatos's Research Programs

Imre Lakatos, a Hungarian philosopher who had studied under Popper and engaged deeply with Kuhn, proposed the methodology of scientific research programs as a more adequate account of how science works than either Popper's falsificationism or Kuhn's paradigm shifts.

A research program consists of a "hard core" — the fundamental theoretical commitments that are not questioned within the program — surrounded by a "protective belt" of auxiliary hypotheses that can be modified to accommodate anomalies. When a prediction fails, scientists do not immediately abandon the hard core; they adjust the protective belt.

The critical distinction Lakatos introduced was between progressive and degenerative research programs. A progressive program is one that generates novel predictions — surprising results that turn out to be true — in addition to accommodating existing data. A degenerative program merely explains anomalies after the fact, revising auxiliary hypotheses to accommodate each new problem without predicting anything new.

On Lakatos's account, rational scientists should abandon degenerative programs and pursue progressive ones — not immediately on each anomaly (pace Popper) but over time as the program's productivity becomes clear.

This framework accommodates the historical observation that scientists sometimes rationally stick with a theory despite apparent refutations (it might be the auxiliary hypotheses, not the core theory, that are wrong) while maintaining that the choice between research programs is not arbitrary (progressive programs earn their place; degenerative ones deserve abandonment).

How Science Actually Works

The Messy Reality

The formal epistemology of Popper, Kuhn, and Lakatos describes how science should work or how it works at the level of major conceptual change. The day-to-day reality of scientific practice is messier, more social, and more contingent.

Theory-laden observation means that what scientists observe is shaped by the theoretical framework they bring to observations. Two scientists with different theoretical commitments looking at the same data may genuinely see different things — not because one is dishonest but because perception is not a passive recording of reality. This does not make observation subjective, but it does mean that separating observation from theory is harder than the naive view of the scientific method suggests.

Background assumptions — about instrumentation, calibration, statistical methods, and the relevance of particular measurements — are pervasive and usually unquestioned. Any experiment tests not just the hypothesis of interest but the entire web of assumptions that have gone into its design and execution. Failed predictions must be attributed to something; attribution is a judgment call.

The replication standard — the requirement that a finding should be reproducible by independent researchers — is the gold standard precisely because it is independent of the original team's assumptions, methods, and biases. A finding that multiple independent teams with different methods converge on is far more reliable than a finding from a single study, however well-designed.

Statistics and Its Proper Use

The Replication Crisis

The replication crisis that became visible around 2011 revealed that a substantial proportion of published findings in psychology, medicine, economics, and nutrition science could not be reproduced when independent researchers attempted exact replications. The Open Science Collaboration's 2015 project, reported in Science, successfully replicated only 36 of 100 published psychology experiments.

The statistical roots of the crisis were identified most influentially by John Ioannidis of Stanford in his 2005 PLOS Medicine paper "Why Most Published Research Findings Are False" (doi: 10.1371/journal.pmed.0020124). Using mathematical modeling of the conditions common in biomedical research — small sample sizes, multiple comparisons, selective reporting, low prior probability of the hypotheses tested — Ioannidis showed that under realistic conditions the majority of published positive findings were likely false positives.

The mechanisms are multiple. P-hacking — running many analyses on a dataset and reporting only those that reach p less than 0.05 — is widespread and produces findings that are artifacts of statistical chance. HARKing — Hypothesizing After Results are Known — transforms post-hoc explanations of unexpected findings into apparent a priori predictions.

Publication bias — the tendency of journals to publish significant results and reject null results — means the published literature is systematically non-representative of the research conducted.

Pre-registration — publicly committing to hypotheses, methods, and analysis plans before data collection begins — addresses these problems by separating confirmatory from exploratory analysis. Pre-registered studies show substantially lower rates of positive findings and higher rates of replication. Major journals now require pre-registration for clinical trials and increasingly encourage it for observational studies.

Demarcation and Pseudoscience

The demarcation problem — distinguishing science from non-science — matters practically for public policy, medical regulation, and legal evidence standards.

Popper's falsifiability criterion remains the most widely used in public discourse: is the claim testable? Homeopathy is testable (it predicts specific therapeutic effects beyond placebo) and has been tested extensively, consistently failing randomized controlled trials. The claim is falsifiable; it has been falsified; it remains in practice not because the evidence supports it but because practitioners find ways to explain away the failures.

Intelligent design is less cleanly testable: its central claim that biological complexity requires an intelligent designer is compatible with virtually any possible observation about biological organisms. When evidence for evolution is presented (fossil record, comparative genomics, directly observed speciation), intelligent design accommodates it by attributing complexity to design rather than predicting what designers would or would not produce.

Paul Thagard's criteria, developed in 1978, go beyond falsifiability to ask about community practice: pseudoscientific communities fail to show progressive problem-solving over time, do not seriously engage with anomalies, do not critically evaluate their own theories, and promote claims to non-specialists without subjecting them to peer scrutiny. By these criteria, astrology qualifies as pseudoscience not just because its predictions fail but because its practitioner community does not respond to failures by revising the theory.

Science and Values

The is-ought distinction, articulated by Hume in 1739, establishes that statements about what is the case cannot, without additional normative premises, entail statements about what ought to be the case. Science describes and explains; it cannot by itself prescribe. Climate science tells us that burning fossil fuels causes warming with specific predicted consequences; it cannot tell us, without value premises, how to weigh present costs against future harms, or how to distribute burdens across countries and generations.

This does not mean science and values are entirely separate. Values influence which research questions receive funding, how results are communicated, which applications are developed, and whose problems are treated as worth solving. Feminist philosophers of science, including Helen Longino in her social epistemology framework, have argued that scientific objectivity is not a property of individual scientists but of scientific communities with appropriate critical norms — diverse perspectives, public evidence standards, and genuine responsiveness to criticism.

The appropriate relationship between science and values is one of mutual constraint rather than either dominance: values should not determine what scientific evidence shows, and scientific evidence should inform but cannot determine what we should do. Semmelweis's colleagues were wrong not because they applied values to his evidence, but because they allowed institutional pride and theoretical commitment to override a straightforward empirical signal.

Science's self-correcting mechanism — replication, peer criticism, openness to revision — is designed precisely to catch and correct such failures. When it works, it is the most reliable method for producing knowledge about the natural and social world that humanity has developed.

References

Popper, K.R. (1959). The Logic of Scientific Discovery. Hutchinson. (Original German: Logik der Forschung, 1934.)
Kuhn, T.S. (1962). The Structure of Scientific Revolutions. University of Chicago Press.
Ioannidis, J.P.A. (2005). Why Most Published Research Findings Are False. PLOS Medicine, 2(8), e124. doi:10.1371/journal.pmed.0020124
Lakatos, I., & Musgrave, A. (Eds.). (1970). Criticism and the Growth of Knowledge. Cambridge University Press.
Hume, D. (1739). A Treatise of Human Nature. London.
Open Science Collaboration (2015). Estimating the reproducibility of psychological science. Science, 349(6251), aac4716.
Thagard, P. (1978). Why Astrology Is a Pseudoscience. PSA: Proceedings of the Biennial Meeting of the Philosophy of Science Association, 1978, 223-234.
Longino, H. (1990). Science as Social Knowledge. Princeton University Press.

Frequently Asked Questions

What is falsificationism and why did Popper think it mattered?

Karl Popper's falsificationism, developed in his 1934 book 'Logik der Forschung' (translated as 'The Logic of Scientific Discovery' in 1959), was his answer to two intertwined problems: the problem of induction and the problem of demarcation. David Hume had established in 1739 that no number of confirming observations can logically prove a universal generalization — observing a million white swans does not prove all swans are white; one black swan falsifies it. Science cannot proceed by accumulating confirmations because confirmation has no logical limit and provides no certainty. Popper's solution was to invert the direction: instead of trying to confirm theories, scientists should try to falsify them. A hypothesis is genuinely scientific if and only if it makes specific predictions that could, in principle, be shown to be wrong. A theory that is compatible with any possible observation — that can explain anything — has no predictive power and therefore no scientific content. Popper used psychoanalysis and Adlerian psychology as examples of unfalsifiable theories: any human behavior could be explained by Freudian or Adlerian concepts after the fact, which meant the theories were never at risk of being disproven. By contrast, Einstein's general theory of relativity made the specific prediction that light would be bent by gravity by a precisely specified amount — a prediction that could be, and was, tested by Arthur Eddington's solar eclipse observations in 1919. Had the result differed from the prediction, the theory would have been falsified. Popper's demarcation criterion — falsifiability as the line between science and non-science — remains influential despite significant philosophical criticism. It is widely used, often imprecisely, in public debates about intelligent design, homeopathy, and other contested knowledge claims.

What is a paradigm shift and how does scientific consensus actually change?

Thomas Kuhn's 1962 book 'The Structure of Scientific Revolutions' transformed how philosophers, historians, and scientists think about how scientific knowledge changes. Kuhn argued that science does not progress through the steady accumulation of facts and theories but through occasional revolutions that replace one entire way of understanding a domain with another. He called these frameworks 'paradigms' — not just theories but entire worldviews, including shared assumptions, methods, exemplary problems and their solutions, and even the perception of what counts as a relevant observation. Normal science, Kuhn argued, is puzzle-solving within a paradigm: scientists take the paradigm for granted and work to extend it, fill in details, and resolve minor inconsistencies. When anomalies accumulate that the paradigm cannot accommodate, a crisis develops, and eventually a revolutionary new paradigm is proposed — Copernican heliocentrism replacing Ptolemaic geocentrism, Newtonian mechanics being superseded by relativity and quantum mechanics, continental drift theory overturning fixed-continent geology. Crucially, Kuhn argued that paradigm shifts are not purely rational events driven by evidence. Scientists are trained within a paradigm and have psychological and professional investments in it; they resist anomalies for a long time before the accumulation becomes untenable. The sociological insight — that scientific communities resist paradigm changes and that normal science is inherently conservative — was controversial and widely misread as implying that science is merely a social construction. Kuhn's actual view was more subtle: paradigm shifts, though not purely rational, are not arbitrary, and science does make genuine progress. But his work permanently complicated the naive view of science as a purely objective enterprise immune to sociology and psychology.

What is the replication crisis and how serious is it?

The replication crisis refers to the widespread finding, beginning around 2011, that a substantial proportion of published scientific findings cannot be reproduced when independent researchers attempt to replicate the original studies. The crisis has affected psychology, medicine, economics, and nutrition science most visibly. In 2015, the Open Science Collaboration published 'Estimating the Reproducibility of Psychological Science' in Science, reporting that only 36 of 100 published psychology experiments replicated successfully when independent teams conducted exact replications. In medicine, John Ioannidis's enormously influential 2005 paper 'Why Most Published Research Findings Are False' in PLOS Medicine used statistical modeling to argue that for research designs common in biomedical science, the majority of published positive findings were likely false positives. The statistical roots of the crisis lie in the misuse and misunderstanding of null hypothesis significance testing. A p-value less than 0.05 does not mean there is a 95% probability that the effect is real; it means that if the null hypothesis were true, you would observe data at least this extreme 5% of the time by chance alone. When researchers run multiple analyses and publish only significant results (p-hacking), when they adjust study designs after seeing preliminary data (HARKing — Hypothesizing After Results are Known), and when small samples produce highly variable results, the published literature becomes systematically misleading. Pre-registration — publicly committing to hypotheses, methods, and analysis plans before data collection — has emerged as the primary solution. Major journals including Nature and Science now encourage or require pre-registration for clinical trials and increasingly for observational studies.

What separates science from pseudoscience?

The demarcation problem — drawing a principled line between science and non-science or pseudoscience — has not been fully solved by philosophers of science, but several criteria have been proposed that are useful in practice. Popper's criterion of falsifiability is the most famous: a claim is scientific if it makes predictions that could be proven wrong. By this criterion, homeopathy is testable (it predicts that highly diluted solutions will have specific therapeutic effects beyond placebo) and has been tested extensively, consistently failing — which makes it bad science rather than non-science. Astrology makes predictions that are falsifiable but systematically fails tests, which similarly makes it a failed science rather than a non-science. Intelligent design fails the falsifiability criterion more completely because it can accommodate any observation about the natural world. Paul Thagard proposed more comprehensive criteria in 1978: pseudoscience is characterized by lack of problem-solving progress over time, by a community that fails to engage seriously with anomalies, that does not evaluate the theory critically, and that promotes the theory to non-specialists without subjecting it to peer scrutiny. By these criteria, astrology qualifies as pseudoscience not just because its predictions fail but because its practitioner community does not respond to failures by revising the theory. The demarcation problem matters practically because it affects public policy (should intelligent design be taught in schools?), medical regulation (should homeopathy be reimbursed by health insurance?), and legal evidence (what scientific testimony is admissible in court?). The Daubert standard in US law attempts to operationalize scientific validity for courtroom purposes.

What is a p-value and why is it so commonly misunderstood?

A p-value is the probability of observing data at least as extreme as the actual data, given that the null hypothesis is true. In the canonical test, the null hypothesis is that there is no effect: for example, that a new drug performs no better than a placebo. If the p-value is 0.03, it means that if the drug had no effect, there would be only a 3% probability of seeing results as dramatic as those observed, simply by chance. By convention (established largely by R.A. Fisher in the 1920s), a p-value below 0.05 is considered 'statistically significant.' This threshold is widely misunderstood in four specific ways. First, a p-value is not the probability that the null hypothesis is true; it is the probability of the observed data given the null hypothesis — a very different thing. Second, statistical significance is not the same as practical or clinical significance. A study with a very large sample can detect effects too small to be of any real-world importance with p less than 0.001. Effect sizes — how big the difference actually is — are as important as whether it exceeds the significance threshold. Third, the 0.05 threshold is arbitrary, not magical. Fourth, p-values from underpowered studies (too few participants) are highly variable and frequently do not replicate. The American Statistical Association issued a 2016 statement clarifying these misunderstandings, and there have been calls to replace the 0.05 significance threshold with pre-specified effect size estimates, Bayesian methods that explicitly incorporate prior probabilities, or requiring replication before publication. The replication crisis has been substantially driven by the widespread misuse and misunderstanding of p-values.

Does science determine values, and can it tell us what we ought to do?

This is the is-ought problem first articulated by David Hume in 'A Treatise of Human Nature' (1739): from statements about what is the case, you cannot derive statements about what ought to be the case without some additional normative premise. Science describes and explains the natural and social world; it cannot by itself prescribe how we should act. A simple example: science can tell us that burning fossil fuels increases atmospheric CO2 and that rising CO2 concentrations will cause significant warming with specific consequences. It cannot tell us, without additional value premises, how much we should sacrifice now to reduce those consequences, how harms and benefits should be distributed across generations, or what obligations wealthy countries have to poorer ones. These are ethical and political questions that science informs but cannot resolve. This does not mean science and values are entirely separate. Values influence which questions scientists choose to study, how results are framed and communicated, and which applications receive funding. Helen Longino's social epistemology argues that scientific objectivity is not a property of individual scientists but of scientific communities with appropriate critical norms: diverse perspectives, public evidence standards, and responsiveness to criticism. Feminist philosophers of science have argued that male-dominated research communities have systematically understudied questions about women's health and over-interpreted data about sex differences. These are not criticisms of science but of specific scientific practices that violate objectivity as a norm. The appropriate relationship between science and values is one of mutual constraint rather than either dominance: values should not determine scientific findings, and scientific findings should inform but not dictate value choices.

How does peer review work and what are its limits?

Peer review is the process by which submitted scientific manuscripts are evaluated by independent experts in the relevant field before publication. In the dominant model, authors submit manuscripts to journals, editors make initial assessments, and manuscripts that pass this gate are sent to typically two or three reviewers who assess the validity of the methods, the quality of the evidence, and the significance of the findings. Reviewers recommend acceptance, revision, or rejection. Peer review serves important functions: it catches methodological errors, ensures appropriate literature engagement, improves the clarity of papers, and provides a quality filter for the published literature. However, its limitations are substantial and well-documented. Peer review is slow, often taking months or years from submission to publication. It is unreliable: studies of peer review have found that reviewers frequently disagree with each other, that journals sometimes reject papers that later become highly cited, and that reviewers fail to detect deliberate errors inserted into manuscripts for testing purposes. It does not detect fraud: the systematic fabrication of data by Diederik Stapel in social psychology, Hwang Woo-suk in stem cell research, and hundreds of others passed peer review without detection. It is subject to bias: studies have found reviewer bias based on authors' institutional affiliation, nationality, and in single-blind review (where reviewers know the authors' identities) gender. Pre-prints — making manuscripts publicly available before peer review, as is standard in physics and increasingly in biology and medicine — address the speed problem and enable broader scrutiny but create different challenges for science communication, since unchecked pre-prints are sometimes reported as established findings.

When Notes Fly

How the Scientific Method Works

Key Definitions

Major Philosophies of Science Compared

The Problem of Induction

Popper's Falsificationism

The Asymmetry of Falsification

Kuhn's Paradigm Shifts

Normal Science and Anomalies

Lakatos's Research Programs

How Science Actually Works

The Messy Reality

Statistics and Its Proper Use

The Replication Crisis

Demarcation and Pseudoscience

Science and Values

References

Tags

Frequently Asked Questions

What is falsificationism and why did Popper think it mattered?

What is a paradigm shift and how does scientific consensus actually change?

What is the replication crisis and how serious is it?

What separates science from pseudoscience?

What is a p-value and why is it so commonly misunderstood?

Does science determine values, and can it tell us what we ought to do?

How does peer review work and what are its limits?

Share this article

Continue Reading

Why Is the Sky Blue? Rayleigh Scattering Explained

Asymmetric Information in Economics: Key Concepts

The Role of Oil in Global Politics and Conflicts

Understanding Long COVID: Symptoms and Ongoing Research

What Happens When You Take Psychedelics: Neuroscience and Effects

Analyzing Total Cost of Electric vs Petrol Cars

Understanding the Neuroscience Behind Sugar Addiction

Economic Mechanisms of Recessions: Causes Explained

When Notes Fly

Search

Popular Topics

Key Definitions

Major Philosophies of Science Compared

The Problem of Induction

Popper's Falsificationism

The Asymmetry of Falsification

Kuhn's Paradigm Shifts

Normal Science and Anomalies

Revolutions and Their Social Dynamics

Lakatos's Research Programs

How Science Actually Works

The Messy Reality

Statistics and Its Proper Use

The Replication Crisis

Demarcation and Pseudoscience

Science and Values

References

Tags

Frequently Asked Questions

What is falsificationism and why did Popper think it mattered?

What is a paradigm shift and how does scientific consensus actually change?

What is the replication crisis and how serious is it?

What separates science from pseudoscience?

What is a p-value and why is it so commonly misunderstood?

Does science determine values, and can it tell us what we ought to do?

How does peer review work and what are its limits?

Share this article

Continue Reading

Why Is the Sky Blue? Rayleigh Scattering Explained

Asymmetric Information in Economics: Key Concepts

The Role of Oil in Global Politics and Conflicts

Understanding Long COVID: Symptoms and Ongoing Research

What Happens When You Take Psychedelics: Neuroscience and Effects

Analyzing Total Cost of Electric vs Petrol Cars

Understanding the Neuroscience Behind Sugar Addiction

Economic Mechanisms of Recessions: Causes Explained

We Value Your Privacy

Cookie Preferences

Essential Cookies

Analytics & Performance Cookies

Advertising & Marketing Cookies