In October 1936, the Literary Digest magazine mailed ten million ballots to American voters asking whom they planned to support in the presidential election. More than two million people responded, and the results were unambiguous: Republican Alf Landon would defeat incumbent Franklin Roosevelt by 57 to 43 percent. The Digest had accurately predicted the winner in every presidential election since 1916. Its sample was enormous. It was wrong by 19 percentage points. Roosevelt won in the most decisive electoral landslide in American history.
The disaster was a lesson in the difference between a large sample and a representative one. The Digest drew its list from automobile registrations and telephone directories -- sources that in Depression-era America skewed sharply toward the affluent and, as it happened, toward Republicans. George Gallup, using a quota sample of roughly 50,000 respondents stratified to match the demographic composition of the electorate, got within 7 percentage points. A statistician friend helped Gallup go further: he predicted not just the election outcome but predicted, in advance, exactly what the Digest would get wrong and approximately what it would say. The episode established that statistical method -- how you sample, how you analyze, what assumptions you make explicit -- matters more than sample size.
Statistics is the science of learning from data in the presence of uncertainty. It encompasses the design of studies, the collection and organization of data, the summary and visualization of what has been observed, and the inferential machinery for drawing general conclusions from particular observations while quantifying how confident those conclusions should be. At its heart, it is a theory of how evidence relates to belief -- a question that has been answered in at least two incompatible ways, by the frequentist and Bayesian traditions, and a question whose practical implications continue to generate controversy in every field that uses data.
"To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of." -- Ronald A. Fisher, 1938
Key Definitions
Descriptive statistics: Quantities that summarize and characterize a dataset -- mean, median, standard deviation, correlation, percentiles -- without inferring beyond the observed data.
Inferential statistics: Methods for drawing conclusions about a population or data-generating process from a sample, with quantified uncertainty.
Null hypothesis: The default assumption that a treatment has no effect or that an observed association is due to chance; hypothesis testing evaluates the evidence against it.
p-value: The probability of obtaining a test statistic at least as extreme as the one observed, assuming the null hypothesis is true; a small p-value is evidence against the null but does not establish the alternative hypothesis.
Confidence interval: A range of parameter values computed from the data such that, if the procedure were repeated many times, a specified fraction (commonly 95 percent) of the intervals would contain the true parameter value.
Posterior distribution: In Bayesian statistics, the updated probability distribution over unknown parameters after observing data, computed by combining the prior distribution with the likelihood.
The Founders: Galton, Pearson, and Fisher
The Birth of Modern Statistics
Statistics before the late nineteenth century was largely the collection of numerical facts about states -- the literal meaning of "statistics," from the German Statistik. The mathematical theory of how to draw inferences from data under uncertainty was largely undeveloped.
Francis Galton, Victorian polymath and cousin of Darwin, transformed the field while trying to understand inheritance. Galton's studies of the heights of parents and children revealed that children of tall parents tend to be tall, but less tall than their parents -- a phenomenon he called regression to the mean. This observation led him to develop the mathematical concept of regression, the technique of fitting a line through a scatter of points to summarize the relationship between two variables. Karl Pearson, Galton's protege at University College London, formalized the Galton correlation coefficient, introduced the standard deviation, developed the chi-squared goodness-of-fit test, and founded the journal Biometrika -- establishing biometrics, the application of statistical methods to biological data, as a distinct discipline.
Ronald A. Fisher arrived at Rothamsted Experimental Station in 1919 and over the next decade produced work that transformed scientific practice across the natural and social sciences. Statistical Methods for Research Workers (1925) assembled techniques for the analysis of experimental data -- analysis of variance, the t-test, the correlation ratio -- in a form accessible to working scientists. The Design of Experiments (1935) made the equally important argument that statistical considerations must inform the design of studies, not merely the analysis: experiments should be randomized to control for hidden confounders, should be replicated to separate real effects from chance variation, and should include proper controls.
The p-value and Its Discontents
Fisher introduced the p-value as a measure of evidence: the probability of obtaining a result at least as extreme as the one observed, assuming the null hypothesis is true. He suggested that p < 0.05 provided "moderate evidence" against the null and p < 0.01 provided "strong evidence," but he never intended these thresholds to function as a binary decision criterion. Evidence, for Fisher, was a matter of degree.
Jerzy Neyman and Egon Pearson developed a competing framework in the 1930s. Rather than measuring evidence against a null hypothesis, they proposed a decision procedure: specify an alternative hypothesis, an error rate alpha (Type I error, false positive), and use the data to choose between the null and alternative in a way that guarantees long-run error control. The Neyman-Pearson framework is the basis of modern hypothesis testing, confidence intervals, and power analysis.
Fisher and Neyman-Pearson disagreed vigorously, and their dispute was as much philosophical as technical. Fisher rejected the requirement to specify an alternative hypothesis as scientifically unrealistic. Neyman rejected Fisher's p-value as providing no decision-theoretic guidance. Their followers merged elements of both frameworks into the "hybrid" null hypothesis significance testing (NHST) that dominates current practice -- a synthesis that neither originator endorsed and that has logical inconsistencies that its critics have enumerated in detail.
Frequentist and Bayesian Approaches
Probability as Frequency
The frequentist interpretation defines probability as the limit of a relative frequency in an infinitely repeated series of identical trials. This makes probability precise and objective but limits its applicability: it makes no sense to speak of the probability that a particular hypothesis is true or that a particular parameter has a particular value, since hypotheses and parameters do not have frequencies of occurrence. Frequentist inference can only speak about the long-run properties of procedures.
A 95 percent confidence interval does not mean that the parameter has a 95 percent probability of falling within the interval. It means that if the procedure were repeated many times, 95 percent of the resulting intervals would contain the true parameter value. The subtle but important distinction between procedure-level statements and object-level probability statements is routinely ignored in practice and in journalism, contributing to widespread misinterpretation.
Probability as Belief
Thomas Bayes, an eighteenth-century English minister, proved in an essay published posthumously in 1763 that beliefs about the probability of a hypothesis should update in proportion to the likelihood of the observed evidence given that hypothesis. Pierre-Simon Laplace developed Bayes' theorem into a general theory of inverse probability -- inferring causes from effects -- in his Theorie Analytique des Probabilites (1812).
Harold Jeffreys at Cambridge kept the Bayesian tradition alive in the early twentieth century, developing objective Bayesian methods based on priors derived from symmetry and invariance principles rather than personal beliefs, and arguing strenuously with Fisher across a series of papers and books. Bruno de Finetti and Leonard J. Savage developed subjective Bayesianism, in which probability is a coherent degree of belief that must obey the axioms of probability theory but need not be grounded in frequency.
The Bayesian revival of the 1990s was computational: Andrew Gelman, Donald Rubin, John Carlin, Hal Stern, and David Dunson's Bayesian Data Analysis (first edition 1995) provided a practical framework for Bayesian inference in complex models. Markov chain Monte Carlo (MCMC) algorithms -- especially the Gibbs sampler and the Metropolis-Hastings algorithm -- made it possible to draw samples from posterior distributions that had no closed-form expression, turning Bayesian methods from a theoretical aspiration into a computational reality. Modern probabilistic programming languages such as Stan, PyMC, and BUGS allow researchers to specify arbitrarily complex hierarchical models and fit them with MCMC, making Bayesian inference routine.
Regression and Causal Inference
Ordinary Least Squares and Its Limits
Ordinary least squares (OLS) regression, which minimizes the sum of squared differences between observed and fitted values, is the most widely used statistical method. It estimates the average association between a dependent variable and one or more predictors, controlling for the predictors included in the model. The method is linear, computationally simple, and has well-understood properties under standard assumptions.
The fundamental limitation of regression in observational data is confounding: the association between a predictor and an outcome may be due to a common cause (confounder) of both, not a causal relationship between them. Adding more control variables reduces confounding only if the right variables are controlled; controlling for the wrong variables (mediators or colliders) can introduce bias rather than remove it.
The Credibility Revolution
The "credibility revolution" in economics, associated with economists including David Card, Alan Krueger, Joshua Angrist, and Guido Imbens (who shared the 2021 Nobel Memorial Prize in Economic Sciences with David Card), developed a set of quasi-experimental designs for estimating causal effects from observational data.
Card and Krueger's 1994 study of minimum wage effects used a natural experiment: New Jersey raised its minimum wage in April 1992 while neighboring Pennsylvania did not. By comparing changes in fast-food employment in New Jersey and Pennsylvania before and after the increase, they estimated the causal effect of the minimum wage on employment -- a finding that challenged the consensus that minimum wage increases reduce employment and sparked decades of subsequent research and methodological refinement.
Instrumental variables use an external "instrument" -- a variable that affects the treatment but has no direct effect on the outcome -- to isolate exogenous variation in treatment. Regression discontinuity designs exploit sharp thresholds in treatment assignment (such as a test score cutoff for a scholarship) to compare observations just above and below the threshold. Difference-in-differences compares the change in outcomes for a treated group to the contemporaneous change for an untreated comparison group.
Judea Pearl's causal inference framework, developed in Causality (2000) and popularized in The Book of Why (2018, with Dana Mackenzie), provides a formal language based on directed acyclic graphs (DAGs) for representing causal assumptions and deriving the conditions under which causal effects can be estimated from observational data. Pearl's do-calculus distinguishes between observing a variable take a value and intervening to set it to a value -- the distinction between correlation and causation formalized mathematically.
Donald Rubin's potential outcomes framework, developed in a series of papers from the 1970s, provides an alternative but related approach. It defines a causal effect as the difference between the outcome under treatment and the outcome under control for the same unit -- the fundamental problem of causal inference being that only one of these potential outcomes can ever be observed.
Sampling and Survey Methods
Probability Sampling
Sound statistical inference requires that the sample be drawn from the population of interest according to a known probability mechanism. In simple random sampling, every unit has an equal probability of selection; this guarantees that the sample is representative in expectation and provides the basis for valid interval estimation.
Stratified sampling divides the population into subgroups and samples each separately, potentially allocating sample size to strata in proportion to their size (proportional allocation) or inversely proportional to the variance within the stratum (optimal allocation) to maximize precision for a given total sample size. Stratification by age, education, and region is standard in political polling.
Cluster sampling selects groups of units rather than individual units. It is practical when a complete sampling frame of individuals is unavailable but a list of clusters (schools, hospitals, geographic areas) exists. Multilevel modeling, associated with Harvey Goldstein, Stephen Raudenbush, and Anthony Bryk, provides methods for analyzing cluster-sampled data while correctly accounting for the correlation within clusters.
Polling in the Modern Era
Political polling illustrates both the power and the fragility of survey sampling. Cell phone proliferation has complicated the random-digit-dialing methods that supported probability sampling of households in the 1970s through 1990s. Response rates have fallen from above 70 percent to below 10 percent in many surveys, making the assumption of ignorable non-response increasingly difficult to defend.
Modern polls weight their respondents by demographic characteristics and, increasingly, by partisan identification and reported 2016 and 2020 vote, to correct for differential response rates. The 2016 US presidential election, in which most national polls showed Hillary Clinton winning comfortably and state-level polls systematically underestimated Donald Trump's support in Wisconsin, Michigan, and Pennsylvania, revealed that weighting by education -- college-educated white voters were overrepresented in polls and voted more Democratic than non-college whites -- was inadequate in state polling, an error that post-election analysis by the American Association for Public Opinion Research documented in detail.
Big Data and the Multiple Testing Problem
Large Data and Spurious Patterns
The era of large observational datasets has revealed a tension between statistical significance and scientific validity. With enough data, almost any association will achieve p < 0.05, however tiny and practically irrelevant the effect size. David Lazer and colleagues' 2014 paper in Science, "The Parable of Google Flu Trends," documented how Google's influenza surveillance system -- which used search query data to predict flu incidence -- dramatically overestimated flu activity during the 2012-13 season after performing well in its early years. The system had been optimized on historical data and had overfit to patterns that did not generalize; it exemplified the danger of large-scale pattern matching without careful statistical discipline.
Tyler Vigen's website "Spurious Correlations" illustrates the problem entertainingly: per capita cheese consumption in the US correlates with deaths by bedsheet tangling at r = 0.95 over the 2000s, a completely meaningless association that is statistically significant because both variables share a temporal trend. With enough variables, such coincidences are inevitable.
Multiple Testing Corrections
The Bonferroni correction addresses the multiple testing problem by requiring each test to exceed alpha/k, where k is the number of tests. It controls the family-wise error rate (the probability of any false positive) at alpha but is conservative because it ignores correlations between tests and loses power severely when many tests are conducted.
Yoav Benjamini and Yosef Hochberg (1995) introduced the false discovery rate (FDR) as a less conservative criterion: control the expected proportion of significant findings that are false discoveries, rather than the probability of any false discovery. Their BH procedure is widely used in genomic association studies, neuroimaging, and other high-dimensional settings where thousands of simultaneous tests are conducted and the goal is to prioritize findings for follow-up rather than provide ironclad individual guarantees.
Regularization and Modern Regression
High-dimensional settings -- many predictors relative to observations -- create new problems for ordinary least squares: the model can perfectly fit the training data while generalizing poorly to new observations. Regularization methods address this by penalizing model complexity. Ridge regression adds a penalty proportional to the sum of squared coefficients, shrinking estimates toward zero. The LASSO (least absolute shrinkage and selection operator), introduced by Robert Tibshirani (1996), adds a penalty proportional to the sum of absolute coefficient values, producing sparse models where many coefficients are exactly zero and thereby performing variable selection. Elastic net combines both penalties. These methods, along with cross-validation for tuning the penalty strength, have become standard in applied statistics and machine learning.
Bootstrap resampling, introduced by Bradley Efron (1979), provides a nonparametric approach to estimating the sampling distribution of virtually any statistic: resample the observed data with replacement repeatedly, compute the statistic on each resample, and use the distribution of resample statistics to construct confidence intervals and test hypotheses without assuming a parametric model. Permutation tests provide a related approach to hypothesis testing: randomly shuffle the assignment of observations to groups many times and assess whether the observed test statistic is unusual relative to the permutation distribution.
Statistics in Public Life
Medical Statistics and Risk Communication
Gerd Gigerenzen's research, documented in Reckoning with Risk (2002) and Calculated Risks (2002), showed that patients, physicians, and medical journalists systematically misunderstand risk information when it is expressed in relative terms. When a screening test for cancer is said to "reduce mortality by 25 percent," most interpreters think this is a substantial benefit; when the same effect is expressed as reducing the absolute death rate from 4 per 1000 to 3 per 1000, the modest absolute benefit becomes visible. Gigerenzen advocated for natural frequencies -- 3 out of 1000 rather than 0.3 percent -- as the most transparent format for communicating risks, a recommendation supported by experimental evidence.
Base rate neglect is equally common in medical contexts. A diagnostic test with sensitivity and specificity both at 99 percent applied to a condition with 1 percent prevalence will produce more false positives than true positives in a population screening program: of 100 positive tests, only about 50 will be true cases. Physicians routinely overestimate the probability that a positive test indicates disease because they fail to account for the prior probability of the condition.
Election Forecasting and the Bayesian Moment
Nate Silver's FiveThirtyEight, founded in 2008, introduced explicit probabilistic election forecasting to a mainstream American audience. Silver's model aggregates polls using a Bayesian framework that accounts for house effects (systematic biases of individual polling organizations), the predictive value of economic indicators, and the correlation of outcomes across states. The model outputs probability distributions over outcomes, not point predictions, and Silver became notable for explaining and defending probabilistic thinking against demands for a single confident prediction.
The broader lesson is that statistical thinking is teachable and communicable. The alternative -- intuitive confidence with hidden uncertainty -- produces decisions based on false certainty, surprise at inevitable variance, and overconfident updating after results. Where statistical literacy is higher, public discourse about risk, evidence, and policy can be more honest.
References
Fisher, R. A. (1925). Statistical Methods for Research Workers. Oliver and Boyd.
Fisher, R. A. (1935). The Design of Experiments. Oliver and Boyd.
Neyman, J., & Pearson, E. S. (1933). On the problem of the most efficient tests of statistical hypotheses. Philosophical Transactions of the Royal Society A, 231, 289-337.
Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. B. (1995). Bayesian Data Analysis. Chapman and Hall.
Card, D., & Krueger, A. B. (1994). Minimum wages and employment: A case study of the fast-food industry in New Jersey and Pennsylvania. American Economic Review, 84(4), 772-793.
Pearl, J. (2000). Causality: Models, Reasoning, and Inference. Cambridge University Press.
Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society B, 57(1), 289-300.
Wasserstein, R. L., & Lazar, N. A. (2016). The ASA's statement on p-values: Context, process, and purpose. The American Statistician, 70(2), 129-133.
Lazer, D., Kennedy, R., King, G., & Vespignani, A. (2014). The parable of Google flu: Traps in big data analysis. Science, 343(6176), 1203-1205.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society B, 58(1), 267-288.
Gigerenzen, G. (2002). Reckoning with Risk: Learning to Live with Uncertainty. Penguin.
Efron, B. (1979). Bootstrap methods: Another look at the jackknife. Annals of Statistics, 7(1), 1-26.
Frequently Asked Questions
What is statistics and what are its main branches?
Statistics is the science of learning from data in the presence of uncertainty. It provides methods for collecting data systematically, summarizing and describing what has been observed, drawing inferences about broader phenomena from limited samples, and quantifying the uncertainty in those inferences.Descriptive statistics summarizes and characterizes a dataset without making inferences beyond it. Measures such as the mean, median, variance, standard deviation, and correlation coefficient are descriptive. The foundations of descriptive statistics were largely laid in the late nineteenth century by Francis Galton, who invented the concepts of regression and correlation while studying the inheritance of traits, and Karl Pearson, who formalized the standard deviation, the correlation coefficient, and the chi-squared test while developing biometrics as a discipline.Inferential statistics uses sample data to draw conclusions about a larger population and to test hypotheses about the processes that generated the data. Ronald Fisher's Statistical Methods for Research Workers (1925) and The Design of Experiments (1935) established the foundations of modern inferential statistics, including analysis of variance, randomization, the concept of the p-value, and the principle that experiments should be designed, not merely analyzed. Jerzy Neyman and Egon Pearson (Karl's son) developed a competing framework based on hypothesis testing with explicit error rates, and their theoretical clash with Fisher shaped the history of statistics through the twentieth century and into the present debates over the replication crisis.
What is the difference between frequentist and Bayesian statistics?
Frequentist statistics interprets probability as the long-run frequency of an event in repeated experiments. A probability of 0.05 means that if the same experiment were run an infinite number of times under identical conditions, the event would occur in 5 percent of trials. Under this interpretation, parameters of a model are fixed unknown quantities, not random variables; one cannot assign them probabilities. Confidence intervals, null hypothesis significance testing, and p-values are frequentist constructs.Bayesian statistics, named after the Reverend Thomas Bayes whose posthumously published essay of 1763 first formalized the key theorem, interprets probability as a degree of belief or confidence in a proposition. Under this interpretation, parameters can have probability distributions representing uncertainty about their values. Bayes' theorem specifies how beliefs should update: the posterior probability is proportional to the prior probability times the likelihood of the data given the parameter. Prior + data = posterior.The philosophical difference generates practical differences. Bayesian analysis requires a prior distribution, which critics view as subjective and potentially influential on conclusions; Bayesian advocates argue that all analyses embody assumptions and that making the prior explicit is more honest than hiding assumptions in frequentist choices of test and model. Harold Jeffreys at Cambridge developed objective Bayesian methods using priors derived from the structure of the problem, not personal beliefs, engaging in extended polemical exchange with Fisher in the 1930s and 1940s.Modern Bayesian computation, pioneered by Andrew Gelman, Donald Rubin, John Carlin, Hal Stern, and others in the 1990s, made Bayesian inference practical for complex models through Markov chain Monte Carlo (MCMC) algorithms, triggering a revival that has made Bayesian methods mainstream in many fields.
What is a p-value and why is it controversial?
A p-value is the probability of obtaining a test statistic at least as extreme as the one observed, assuming the null hypothesis is true. If you observe a result with p = 0.03, it means that results this extreme or more extreme occur in only 3 percent of studies when the null hypothesis holds. Ronald Fisher introduced the p-value in the 1920s as a measure of the strength of evidence against the null hypothesis, suggesting that p < 0.05 provides reasonable evidence worth reporting.Fisher never intended the 0.05 threshold to be a decision criterion, but Neyman and Pearson's hypothesis-testing framework, which requires specifying an alpha level (Type I error rate) before the experiment, hardened the 0.05 threshold into a binary publication criterion: p < 0.05 is 'significant' and publishable; p > 0.05 is not. This dichotomization has had damaging consequences for science.The replication crisis, which crystallized around 2011 when Brian Nosek and colleagues found that fewer than half of 100 psychology studies replicated at p < 0.05, is substantially a crisis of statistical practice. Researchers engage in p-hacking (trying multiple analyses and reporting only those below 0.05), HARKing (hypothesizing after results are known), selective outcome reporting, and other practices that inflate false positive rates far above the nominal 5 percent. Publication bias -- journals prefer significant results -- compounds the problem.The American Statistical Association's 2016 statement on p-values, authored by Ronald Wasserstein and Nicole Lazar, stated explicitly that a p-value does not measure the probability that a hypothesis is true, that a p-value does not measure the size or importance of an effect, and that p > 0.05 does not mean the null hypothesis is true. The statement stimulated extensive discussion about alternatives, including moving to confidence intervals, effect sizes, Bayesian credible intervals, or abandoning null hypothesis significance testing altogether.
How do statisticians distinguish correlation from causation?
Establishing causal claims from data is harder than establishing correlations. Two variables can be correlated because one causes the other, because the other causes the one, because both are caused by a third variable (confounding), or by chance. Ordinary regression analysis estimates associations, not causal effects; interpreting regression coefficients causally requires additional assumptions that may or may not hold.The gold standard for causal inference is a randomized controlled trial (RCT), in which subjects are randomly assigned to treatment or control conditions. Randomization ensures that treatment assignment is independent of all other variables, both observed and unobserved, so that differences in outcomes can be attributed to the treatment. Fisher formalized the role of randomization in experimental design.In observational settings where randomization is impossible, researchers use natural experiments -- situations where assignment to treatment is determined by external factors that are effectively random with respect to outcomes. David Card and Alan Krueger's famous 1994 study of the effect of minimum wage increases on employment used the natural experiment of a 1992 New Jersey minimum wage increase, comparing fast-food employment across the New Jersey-Pennsylvania border before and after the change. The difference-in-differences design used in that study, along with instrumental variables and regression discontinuity designs, constitutes the 'credibility revolution' in applied economics.Judea Pearl's do-calculus (Pearl, 2000) provides a formal language for causal inference based on directed acyclic graphs (DAGs), which represent causal relationships as directed edges between variables. The framework distinguishes observational conditioning (adjusting for a variable in a regression) from interventional manipulation (setting a variable by external action) and provides rules for determining when causal quantities can be estimated from observational data.
How does sampling work and what are its main pitfalls?
A sample is a subset of a population drawn for the purpose of making inferences about the whole. Simple random sampling gives every member of the population an equal probability of selection, ensuring representativeness in expectation. Stratified sampling divides the population into subgroups (strata) and samples each separately, improving precision for characteristics that vary between strata. Cluster sampling divides the population into clusters (such as households or schools), randomly selects clusters, and samples all members within selected clusters; it is more practical when a full sampling frame is unavailable but less precise than simple random sampling.The Literary Digest poll of 1936 is the most famous sampling disaster in history. The magazine mailed ten million ballots drawn from automobile registrations and telephone directories, received over two million responses, and predicted that Alf Landon would defeat Franklin Roosevelt by 57 to 43 percent. Roosevelt won in a landslide, 62 to 38. The sample was large but systematically biased: automobile owners and telephone subscribers in 1936 were disproportionately affluent and Republican. George Gallup predicted the outcome correctly using a much smaller but more representative quota sample.Non-response bias occurs when those who respond to surveys differ systematically from those who do not. Response rates for telephone surveys have declined from over 70 percent in the 1970s to under 10 percent in many current surveys; this does not necessarily invalidate surveys if non-response is accounted for through weighting, but it makes the assumption of ignorable non-response harder to defend. Cell phone adoption has complicated random-digit-dialing sampling because cell phone users were historically younger and more mobile and could not be reached through listed telephone directories. Modern pollsters weight their samples by age, education, race, and past vote to correct for differential response rates.
What is the multiple testing problem and how is it addressed?
The multiple testing problem, also called the problem of multiplicity, arises when a researcher tests many hypotheses simultaneously. If each test is conducted at alpha = 0.05, the probability of at least one false positive increases rapidly with the number of tests: with 20 independent tests, the probability of at least one false positive under the global null is 1 - 0.95^20 = 0.64. In genomics, where researchers test hundreds of thousands of single-nucleotide polymorphisms for association with a disease, the number of expected false positives under naive testing is enormous.The Bonferroni correction is the simplest and most conservative adjustment: divide the alpha threshold by the number of tests. For 20 tests at alpha = 0.05, the per-test threshold becomes 0.0025. The correction controls the family-wise error rate (FWER) -- the probability of any false positive -- at alpha. It is appropriate when any false positive is unacceptable but is very conservative, losing power substantially when tests are numerous and positively correlated.Yoav Benjamini and Yosef Hochberg (1995) developed the false discovery rate (FDR) procedure, which controls the expected fraction of significant findings that are false positives rather than the probability of any false positive. The BH procedure is less conservative than Bonferroni and is widely used in genomics, neuroimaging, and other high-dimensional settings.Large datasets introduce a related problem: with enough data, trivially small effects become statistically significant. The distinction between statistical significance and practical significance -- effect size and its uncertainty -- is crucial. A difference of 0.001 points on a scale may be highly significant with a large enough sample while being completely irrelevant for any decision or understanding.
How is statistics misused and misunderstood in public life?
Darrell Huff's How to Lie with Statistics (1954) documented a catalogue of graphical and arithmetical tricks used to mislead with data: truncated axes that exaggerate small differences, averages that misrepresent skewed distributions, samples that are unrepresentative, and correlations that are presented as causation. The book is still in print and its lessons remain relevant.Relative risk and absolute risk are frequently confused in health reporting. A drug that reduces the risk of a disease from 2 percent to 1 percent has halved the relative risk -- a 50 percent reduction -- but reduced the absolute risk by only 1 percentage point. Both statements are true; the choice of framing strongly influences public perception of the drug's benefit. Gerd Gigerenzen (Reckoning with Risk, 2002) documented that patients, physicians, and journalists systematically overestimate benefits and underestimate harms when information is presented in relative rather than natural frequency terms, and developed training programs to improve medical risk communication.Base rate neglect -- the failure to account for the prior probability of an event when evaluating evidence -- is pervasive. A test with 99 percent sensitivity and 99 percent specificity for a disease that affects 1 in 1000 people will produce roughly 10 false positives for every true positive in a population-wide screening; most people who test positive do not have the disease. This basic statistical fact is routinely misunderstood by patients, physicians, and jurors.Nate Silver's The Signal and the Noise (2012) analyzed forecasting across domains -- weather, baseball, elections, economics -- and argued that explicit probabilistic forecasting, Bayesian updating, and intellectual humility about model uncertainty produce better predictions than overconfident point predictions. Silver's FiveThirtyEight model, correctly called the probability distribution over outcomes rather than a single prediction, demonstrated to a wide audience that statistical thinking and explicit uncertainty quantification can be communicated publicly and productively.