The data is clear: sales increased 15% after the new strategy launched. Your hypothesis was right. Or was it? Did you check if sales were already trending up? Did competitors lose market share, suggesting external factors? Did you analyze only the successful segments and ignore failures? Did you keep testing until you found something significant?
Data doesn't speak for itself. It whispers ambiguously, and we hear what we want to hear. Confirmation bias, p-hacking, correlation-causation confusion, and dozens of other cognitive traps transform data into self-deception disguised as evidence. The result: confident conclusions based on flawed analysis, decisions that feel data-driven but aren't.
Understanding how we fool ourselves with data—and how to avoid it—is foundational to actually learning from evidence rather than using it to justify preconceptions.
The Fundamental Problem: We See What We Expect
As physicist Richard Feynman warned, "The first principle is that you must not fool yourself—and you are the easiest person to fool."
Confirmation Bias in Data Analysis
Cognitive bias: The tendency to search for, interpret, and recall information that confirms preexisting beliefs.
In data analysis:
- Look only for data supporting hypothesis
- Interpret ambiguous results favorably
- Remember confirming data, forget disconfirming
- Stop analyzing when results match expectations
The Motivated Reasoning Trap
Motivated reasoning: Reasoning driven by desired conclusion, not truth-seeking.
How it manifests:
| Stage | Motivated Analysis | Objective Analysis |
|---|---|---|
| Question formation | "Can I believe X?" (if yes, stop) | "Must I believe X?" (seek strong evidence) |
| Data collection | Seek confirming evidence | Seek disconfirming evidence |
| Analysis | Stop when results favorable | Pre-specify analysis plan |
| Interpretation | Favorable spin on ambiguity | Report limitations honestly |
Research (Kunda, 1990): People are highly skilled at constructing justifications for preferred conclusions.
"A reliable way to make people believe in falsehoods is frequent repetition, because familiarity is not easily distinguished from truth." — Daniel Kahneman, Thinking, Fast and Slow
Trap 1: Correlation ≠ Causation
The Classic Mistake
Observation: X and Y move together Conclusion: X causes Y Reality: Many other possibilities
Why Things Correlate Without Causation
| Explanation | Example |
|---|---|
| Coincidence | Ice cream sales correlate with drowning deaths (both peak in summer) |
| Reverse causation | Does depression cause unemployment, or does unemployment cause depression? |
| Third variable | Height correlates with vocabulary in children (both caused by age) |
| Bidirectional | Does better sleep improve performance, or does success reduce stress → better sleep? |
The Gold Standard: Randomized Controlled Trials
Why RCTs work:
- Random assignment eliminates confounders
- Control group shows what would happen anyway
- Difference isolates causal effect
Why observational data is tricky:
- Can't assign randomly (unethical, impractical)
- Can't see counterfactual (what would have happened)
- Confounders hide everywhere
Example: The Hormone Replacement Therapy Reversal
Observational studies (1980s-1990s):
- Women taking HRT had lower heart disease
- Conclusion: HRT prevents heart disease
Randomized trial (Women's Health Initiative, 2002):
- HRT increased heart disease risk
- Reversal of medical advice
What went wrong?
- Confounding: Women who chose HRT were healthier, wealthier, better healthcare
- Correlation (HRT + good health) didn't mean causation (HRT → good health)
Tools to Establish Causation
| Method | Strength | Limitation |
|---|---|---|
| RCT | Gold standard | Often impractical/unethical |
| Natural experiments | Leverages real-world variation | Rare, assumptions required |
| Instrumental variables | Isolates exogenous variation | Hard to find valid instruments |
| Regression discontinuity | Strong causal inference | Requires sharp cutoff |
| Difference-in-differences | Controls for trends | Parallel trends assumption |
For non-experimental data: Causation claims require extraordinary evidence.
Trap 2: P-Hacking and Multiple Comparisons
What is P-Hacking?
P-hacking: Manipulating analysis until you achieve statistical significance (p < 0.05).
Common tactics:
| Tactic | Example |
|---|---|
| Try multiple tests | Test 20 variables; report the one that's significant |
| Flexible stopping | Keep collecting data until p < 0.05 |
| Dropping outliers | Remove data points that weaken result |
| Subgroup analysis | Test in multiple subgroups until one is significant |
| Outcome switching | If primary outcome fails, try secondary outcomes |
Why It's Deceptive
The multiple comparisons problem:
If you test 20 hypotheses with α = 0.05 (5% false positive rate):
- Expected false positives: 20 × 0.05 = 1
- You'll likely find at least one "significant" result by chance
Result: P-hacking inflates false positive rate from 5% to 50%+
"Statistical significance is the least interesting thing about the results. You should describe the results in terms of measures of magnitude — not just, does a treatment affect people, but how much does it affect people." — Andrew Gelman, Columbia University statistician
The Replication Crisis
Many published findings don't replicate:
| Field | Replication Rate |
|---|---|
| Psychology | ~36% (Open Science Collaboration, 2015) |
| Economics | ~50-60% |
| Preclinical cancer research | ~11% |
Contributing factor: P-hacking and publication bias (journals publish positive results, not null findings).
Solutions
Pre-registration
Method:
- Specify hypotheses, methods, analysis plan before seeing data
- Prevents post-hoc storytelling
Prevents:
- Outcome switching
- Flexible stopping
- Selective reporting
Bonferroni Correction
For multiple comparisons:
| Number of Tests | Required p-value |
|---|---|
| 1 | 0.05 |
| 5 | 0.01 (0.05 / 5) |
| 20 | 0.0025 (0.05 / 20) |
Principle: Adjust threshold for multiple testing.
Replication
Best solution: Replicate finding with new data.
If replicates: Likely real If doesn't replicate: Likely false positive or context-dependent
Trap 3: Base Rate Neglect
The Problem
Base rate neglect: Ignoring how common something is when interpreting evidence.
Classic Example: Medical Testing
Scenario:
- Disease prevalence: 1% (base rate)
- Test sensitivity: 90% (detects 90% of cases)
- Test specificity: 90% (correctly identifies 90% of healthy people)
You test positive. What's the probability you have the disease?
Intuitive answer: 90% Correct answer: ~8%
The Math (Bayes' Theorem)
| Group | Population | Test Result | Count |
|---|---|---|---|
| Has disease | 10 (1% of 1,000) | 9 test positive (90% sensitivity) | 9 true positives |
| No disease | 990 | 99 test positive (10% false positive) | 99 false positives |
| Total positive tests | — | — | 108 |
P(Disease | Positive) = 9 / 108 = 8.3%
Key insight: Low base rate means most positives are false, even with accurate test.
"The human mind is a pattern-seeking device. It will find patterns whether they exist or not." — Nassim Nicholas Taleb, The Black Swan
Application to Business Decisions
Example: Fraud detection
| Scenario | Base Rate | False Positive Impact |
|---|---|---|
| Low fraud rate (0.1%) | 1 in 1,000 | Even 99% accurate test → most "fraud" alerts are false |
| Implication | — | Can't act on every alert; need triage |
Lesson: Always consider base rates when interpreting diagnostic data.
Trap 4: Simpson's Paradox
The Phenomenon
Simpson's Paradox: A trend appears in different groups but reverses when groups combined.
Famous Example: UC Berkeley Gender Bias
Overall data (1973):
- Men admission rate: 44%
- Women admission rate: 35%
- Conclusion: Gender discrimination?
Department-level data:
| Department | Men Applied | Men Admitted | Women Applied | Women Admitted |
|---|---|---|---|---|
| A | 825 | 62% | 108 | 82% |
| B | 560 | 63% | 25 | 68% |
| C | 325 | 37% | 593 | 34% |
| D | 417 | 33% | 375 | 35% |
| E | 191 | 28% | 393 | 24% |
| F | 373 | 6% | 341 | 7% |
Reality:
- Women applied to more competitive departments (C-F)
- Within departments, women often admitted at higher rates
- Aggregation created false appearance of bias
Lesson
Aggregate data can mislead. Always check subgroups.
Mechanism: Confounding variable (department difficulty) drives both acceptance rate and gender distribution of applicants.
Trap 5: Survivorship Bias
What It Is
Survivorship bias: Analyzing only successes that "survived" some selection process, ignoring failures.
Classic Example: WWII Aircraft
Problem: Where to add armor to bombers?
Naive analysis:
- Examine returning planes
- Add armor where they have bullet holes
Correct analysis (Abraham Wald):
- Returning planes survived despite bullet holes
- Add armor where returning planes don't have holes (because planes hit there didn't return)
Business Applications
| Misleading Analysis | Missing Data | Corrected View |
|---|---|---|
| "Successful entrepreneurs are risk-takers" | Failed risk-takers (no longer visible) | Both successful and failed took risks; risk-taking doesn't predict success alone |
| "Top companies have great culture" | Failed companies that also had "great culture" | Correlation, not causation |
| "Dropped users weren't engaged" | Can't survey people who left | Exit interviews reveal actual reasons |
How to Avoid
Include the denominator:
- Don't just count successes
- Count total attempts (successes + failures)
- Survivorship rate = successes / total attempts
Trap 6: Regression to the Mean
The Phenomenon
Regression to the mean: Extreme values tend to be followed by more average values.
How It Fools Us
Scenario:
- Employee has terrible quarter (bottom 10%)
- Manager reprimands employee
- Next quarter, performance improves
- Manager concludes: "Reprimands work!"
Reality: Random variation. Extreme performance (good or bad) tends to revert toward average, regardless of intervention.
Sports Illustrated Cover Jinx
Observation: Athletes on Sports Illustrated cover often have worse performance after.
Explanation:
- Athletes appear on cover after exceptional performance (outlier)
- Next period, performance regresses toward their true average (appears as decline)
- No curse; just statistics
How to Detect
Indicators of regression to mean:
- Selection based on extreme outcome
- Intervention after extreme observation
- Improvement toward average afterward
Test: Include control group. If control also improves, likely regression to mean, not intervention effect.
Trap 7: Cherry-Picking Data
The Practice
Cherry-picking: Selecting data that supports conclusion, ignoring data that doesn't.
"It is easy to lie with statistics. It is hard to tell the truth without it." — Darrell Huff, How to Lie with Statistics
Forms of Cherry-Picking
| Type | Example |
|---|---|
| Time period | Show revenue growth starting after recession, ignoring pre-recession decline |
| Geography | Report successful regions, ignore failed regions |
| Metric selection | Report metrics that improved, ignore those that declined |
| Subgroup | "Drug works in women under 40" (after testing didn't find overall effect) |
The HARKing Problem
HARKing: Hypothesizing After Results are Known
Process:
- Explore data, find interesting pattern
- Construct hypothesis explaining pattern
- Present as if hypothesis preceded analysis
Problem: Overfits to noise; won't replicate.
Honest approach: Clearly label exploratory vs. confirmatory analysis.
Trap 8: Ignoring Effect Size
Statistical Significance ≠ Practical Importance
Statistical significance: Result unlikely due to chance Effect size: How big is the difference?
Example: Weight Loss Drug
Trial results:
- Drug group: Average weight loss 1.5 pounds
- Placebo group: Average weight loss 1.0 pounds
- Difference: 0.5 pounds
- p < 0.001 (highly significant)
Interpretation:
- Statistically significant? Yes
- Practically meaningful? No (0.5 pound difference is trivial)
Why This Happens
Large sample sizes make small effects significant:
| Sample Size | Effect Needed for p < 0.05 |
|---|---|
| 10 per group | Large |
| 100 per group | Moderate |
| 10,000 per group | Tiny |
With big data: Everything becomes "significant," even trivial effects.
Report Effect Sizes
| Metric | What It Shows |
|---|---|
| Cohen's d | Standardized mean difference |
| R² | % variance explained |
| Odds ratio | Relative risk |
| Absolute difference | Raw difference between groups |
Principle: Report both statistical significance and effect size.
Strategies for Honest Data Interpretation
Strategy 1: Pre-Specify Analysis
Before seeing data:
- State hypotheses
- Define metrics
- Specify statistical tests
- Set sample size
Prevents: HARKing, p-hacking, outcome switching
Strategy 2: Seek Disconfirmation
Instead of: "What data supports my hypothesis?" Ask: "What evidence would prove me wrong?"
Actively look for:
- Contradicting data
- Alternative explanations
- Null results
Strategy 3: Use Blind Analysis
Technique: Analyze data without knowing which group is which.
Prevents: Unconscious bias in analysis decisions.
Strategy 4: Replicate
Internal replication:
- Split data: train set (explore), test set (confirm)
- Finding must hold in both
External replication:
- New data, new sample
- Strongest evidence
Strategy 5: Consider Alternative Explanations
For any finding, ask:
| Question | Why It Matters |
|---|---|
| Could it be chance? | Statistical significance doesn't mean it's real |
| Could it be confounding? | Third variable causing both? |
| Could it be reverse causation? | Y causing X instead of X causing Y? |
| Could it be selection bias? | Non-random sample? |
| Could it be measurement error? | Unreliable data? |
Strategy 6: Check Your Assumptions
Common assumptions:
| Assumption | How to Check |
|---|---|
| Random sampling | How was data collected? Representative? |
| No missing data issues | Is missingness random or systematic? |
| Measurement validity | Does metric actually measure construct? |
| Linearity | Relationship actually linear? |
| Independence | Observations truly independent? |
If assumptions violated: Conclusions may be invalid.
Red Flags You're Fooling Yourself
Warning Signs
| Red Flag | What It Suggests |
|---|---|
| Results perfectly match expectations | Confirmation bias or p-hacking |
| Analysis decisions made post-hoc | HARKing |
| Only looked for supporting evidence | Cherry-picking |
| Can't think of alternative explanations | Closed mindset |
| Results "too good to be true" | Probably are |
| Complex statistical techniques you don't understand | Hiding behind complexity |
| Can't explain finding to non-expert | Don't really understand it |
Building Epistemic Humility
Acknowledge Uncertainty
Avoid: "Data proves X" Better: "Data suggests X, but limitations include..."
Components of honest reporting:
- Point estimate and confidence interval
- Effect size and significance
- Limitations and alternative explanations
- Assumptions made
The Bayesian Mindset
Update beliefs based on evidence, but:
- Stronger prior beliefs require stronger evidence to change
- Extraordinary claims require extraordinary evidence
- One study rarely settles questions definitively
Intellectual Honesty Practices
| Practice | How |
|---|---|
| Report null results | If hypothesis wasn't supported, say so |
| Disclose all analyses | Not just significant ones |
| Acknowledge limitations | Every study has weaknesses |
| Share data | Transparency enables scrutiny |
| Welcome criticism | Critique improves knowledge |
Practical Checklist: Before Drawing Conclusions
Ask yourself:
- Did I pre-specify my analysis, or decide after seeing results?
- Did I test multiple hypotheses? If so, did I correct for multiple comparisons?
- Could this correlation be explained by confounding, reverse causation, or coincidence?
- Did I consider the base rate?
- Is the effect size meaningful, not just statistically significant?
- Did I look for disconfirming evidence, or only confirming?
- Could this be regression to the mean?
- Am I analyzing survivors only, ignoring failures?
- Did I cherry-pick time periods, subgroups, or metrics?
- Can I explain alternative explanations for this finding?
- Would I believe this result if it contradicted my expectations?
If any answer is concerning: Revise analysis before drawing conclusions.
Conclusion: Eternal Vigilance
The uncomfortable truth: We're naturally bad at interpreting data objectively.
Cognitive biases aren't bugs you can fix. They're features of human cognition that require constant vigilance.
The antidotes:
- Pre-specification (prevents p-hacking)
- Seeking disconfirmation (counters confirmation bias)
- Considering alternatives (prevents premature closure)
- Replication (separates signal from noise)
- Epistemic humility (acknowledges limits)
Data doesn't speak for itself. We speak for it. The question is whether we're honest translators or motivated storytellers.
Choose honesty. It's harder. It's worth it.
What Research Shows About Data Interpretation and Cognitive Bias
The scientific literature on human data interpretation reveals a consistent and troubling pattern: trained professionals systematically misread data in predictable ways, and statistical education alone does not reliably correct these errors. Ziv Carmon and Dan Ariely at INSEAD and Duke University published a 2000 study in the Journal of Consumer Research measuring how analysts interpret the same dataset when framed as gains versus losses. Participants were significantly more likely to reach statistically unjustified positive conclusions when data was presented in gain framing, even when the underlying numbers were identical. The finding extended Kahneman and Tversky's classic prospect theory into applied data analysis contexts, demonstrating that the same cognitive asymmetry that distorts financial decision-making also distorts data interpretation by professionals who know they are being tested.
Uri Simonsohn, Joseph Simmons, and Leif Nelson at the University of Pennsylvania and Yale published their landmark 2011 study "False-Positive Psychology" in Psychological Science, documenting how researcher degrees of freedom -- the undisclosed flexibility analysts have in deciding how to collect and analyze data -- can produce false positive rates exceeding 60% even when nominal significance thresholds are set at 5%. They demonstrated this experimentally: using standard but undisclosed analytical flexibility, they produced a statistically significant result showing that listening to "When I'm Sixty-Four" made participants younger (as measured by their reported age). The absurdity of the finding illustrated the problem precisely. Their subsequent work led directly to the pre-registration movement; by 2022, the Open Science Framework had logged over 100,000 pre-registered studies, with pre-registered studies showing replication rates approximately 35 percentage points higher than non-pre-registered studies in the same fields.
Leidy Klotz at the University of Virginia's School of Engineering and Applied Science published research in Nature in 2021 examining a systematic bias in problem-solving that directly affects data analysis: the tendency to add rather than subtract. In a series of experiments spanning engineering problems, recipe modification, and essay editing, Klotz found that participants systematically overlooked subtractive solutions -- removing variables, eliminating confounders, simplifying models -- in favor of additive ones. Applied to data analysis, this bias manifests as the tendency to add more statistical controls, more subgroup analyses, and more variables rather than questioning whether the foundational measurement is valid. Klotz found that even when the optimal solution was subtractive, only 20% of participants spontaneously identified it without a prompt, compared to 60% when the possibility of subtraction was mentioned. The implication for data interpretation is that analysts systematically over-complicate models rather than questioning their basic assumptions.
The most systematic evidence on the gap between statistical training and statistical practice comes from Gerd Gigerenzer at the Max Planck Institute for Human Development in Berlin, who has studied statistical reasoning across medical, legal, and scientific professionals for over three decades. His 2015 book Risk Savvy and associated research in Psychological Science documented that physicians with graduate-level statistical training routinely misinterpret diagnostic test results, failing to apply base rate reasoning even when explicitly told the base rate. In one study of 160 German physicians, Gigerenzer found that when given the sensitivity and specificity of a mammography test along with the base rate of breast cancer, fewer than 20% correctly calculated the probability that a positive result indicated cancer. The majority overestimated the probability by a factor of 10 or more. The error is not ignorance of Bayes' theorem -- these physicians had learned it -- but failure to apply it spontaneously when interpreting real data. Gigerenzer's intervention research showed that presenting the same information in natural frequency formats (10 out of 1,000 women have cancer; 9 of those test positive; 90 of the 990 cancer-free women also test positive) rather than probability formats (sensitivity 90%, specificity 90%, prevalence 1%) increased correct reasoning from under 20% to over 75%.
Real-World Case Studies in Data Misinterpretation
The 2010 Gulf of Mexico Deepwater Horizon oil spill, which killed 11 workers and released approximately 4.9 million barrels of oil, illustrates how survivorship bias and motivated data interpretation can produce catastrophic misreading of safety evidence. A 2011 investigation by the National Commission on the BP Deepwater Horizon Oil Spill and Offshore Drilling documented that BP, Transocean, and Halliburton had each interpreted their respective safety test results through the lens of what the investigation called "normalization of deviance" -- the progressive reclassification of anomalous data as acceptable because previous anomalies had not produced visible failures. The negative pressure test conducted on April 20, 2010, the day of the explosion, produced readings that multiple engineers recognized as anomalous. Rather than halting operations, the team rationalized the anomalous readings as the result of the "bladder effect," a hypothesis that had no engineering basis and was inconsistent with the data. The investigation found that this pattern of post-hoc rationalization of anomalous safety data had occurred on at least 11 previous BP wells without producing visible failures -- creating a survivorship bias that made the interpretive errors appear validated by experience.
Google's early advertising measurement infrastructure provides a large-scale case study in the difference between statistical significance and practical effect size. During the period 2011-2013, Google ran over 12,000 advertising experiments annually, a practice documented by the company's chief economist Hal Varian and colleagues in a 2009 paper in The American Economic Review. Because of the scale of Google's experiment infrastructure -- many tests running on millions of users simultaneously -- nearly every test produced statistically significant results for multiple metrics. The problem, identified by Google's data science team and documented in subsequent publications, was that statistically significant effects on secondary metrics were frequently not practically meaningful. A 0.001% improvement in click-through rate would achieve p < 0.001 at Google's user scale but represented no meaningful business impact. The company had to develop a parallel evaluation framework based on minimum detectable effect thresholds rather than p-values, and implemented mandatory effect size reporting alongside significance testing for all major decisions by 2014. The case illustrates the p-value problem at industrial scale: sufficiently large samples make virtually any difference significant, while practical significance requires a separate assessment that standard null hypothesis testing does not provide.
The replication failures in priming research provide a documented case study in p-hacking and publication bias operating over an extended period. Social priming -- the finding that subtle contextual cues can significantly alter behavior -- was one of the most prolific research areas in social psychology from roughly 1995 to 2012, generating hundreds of published studies and widespread popular coverage. The most famous finding, John Bargh's 1996 study showing that exposure to words associated with elderly people caused people to walk more slowly, was cited over 3,000 times. When Stephane Doyen and colleagues at the Université Libre de Bruxelles conducted a pre-registered replication of the Bargh study in 2012, published in PLOS ONE, they found no effect. Subsequent large-scale replication attempts by the Open Science Collaboration and the Many Labs project found that social priming effects replicated at rates of approximately 14% across 28 tested effects. The failure was not random error; it was systematic. Because small sample studies with analyst flexibility can produce significant results for almost any hypothesis, the published literature had accumulated a large body of findings that were artifacts of analytical flexibility rather than real phenomena. The lesson: a large body of individually significant findings can collectively represent noise if the studies share the same methodological vulnerabilities.
The 2016 failure of polling models to predict Donald Trump's presidential victory in the United States illustrates base rate neglect and model overconfidence in high-stakes real-world data interpretation. Most major forecasting models assigned Trump a probability of 15-30% of winning on election eve, with the FiveThirtyEight model at 28.6% and the New York Times Upshot model at 15%. A 2017 analysis by Andrew Gelman at Columbia University's Applied Statistics Center, published in Statistics and Public Policy, identified several systematic errors in the models. First, models consistently underweighted the base rate of polling error, treating historical polling accuracy as if it were the expected case rather than the best case; in reality, polls in competitive elections had historically been off by margins sufficient to reverse the apparent leader in approximately 30% of close races. Second, models failed to account for correlated polling errors -- the fact that if polls were wrong in one competitive state, they were likely wrong in the same direction in other demographically similar states. The 2016 polling errors were in fact nationally correlated, as they were in 2020 to a greater degree, but the forecasting models treated state-level polls as independent rather than correlated. Gelman estimated that accounting for correlated error would have produced substantially higher uncertainty estimates that better captured the actual likelihood of the outcome, demonstrating that ignoring the base rate of measurement error systematically understated true uncertainty.
References
Kunda, Z. (1990). "The Case for Motivated Reasoning." Psychological Bulletin, 108(3), 480–498.
Ioannidis, J. P. A. (2005). "Why Most Published Research Findings Are False." PLOS Medicine, 2(8), e124.
Open Science Collaboration. (2015). "Estimating the Reproducibility of Psychological Science." Science, 349(6251), aac4716.
Gelman, A., & Loken, E. (2014). "The Statistical Crisis in Science." American Scientist, 102(6), 460–465.
Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). "False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant." Psychological Science, 22(11), 1359–1366.
Kahneman, D., & Tversky, A. (1973). "On the Psychology of Prediction." Psychological Review, 80(4), 237–251.
Bickel, P. J., Hammel, E. A., & O'Connell, J. W. (1975). "Sex Bias in Graduate Admissions: Data from Berkeley." Science, 187(4175), 398–404.
Gigerenzer, G., & Hoffrage, U. (1995). "How to Improve Bayesian Reasoning Without Instruction: Frequency Formats." Psychological Review, 102(4), 684–704.
Ziliak, S. T., & McCloskey, D. N. (2008). The Cult of Statistical Significance: How the Standard Error Costs Us Jobs, Justice, and Lives. University of Michigan Press.
Pearl, J., & Mackenzie, D. (2018). The Book of Why: The New Science of Cause and Effect. Basic Books.
Nosek, B. A., Ebersole, C. R., DeHaven, A. C., & Mellor, D. T. (2018). "The Preregistration Revolution." Proceedings of the National Academy of Sciences, 115(11), 2600–2606.
Tukey, J. W. (1977). Exploratory Data Analysis. Addison-Wesley.
Cohen, J. (1994). "The Earth Is Round (p < .05)." American Psychologist, 49(12), 997–1003.
Wasserstein, R. L., & Lazar, N. A. (2016). "The ASA Statement on p-Values: Context, Process, and Purpose." The American Statistician, 70(2), 129–133.
Nuzzo, R. (2014). "Statistical Errors." Nature, 506(7487), 150–152.
About This Series: This article is part of a larger exploration of measurement, metrics, and evaluation. For related concepts, see [Why Metrics Often Mislead], [Measurement Bias Explained], [Designing Useful Measurement Systems], and [Correlation Is Not Causation].
Frequently Asked Questions
What are common data interpretation mistakes?
Confusing correlation with causation, confirmation bias, p-hacking, ignoring base rates, cherry-picking data, and missing confounding variables.
How do you avoid confirmation bias in data analysis?
Actively seek disconfirming evidence, pre-specify analyses, use blind analysis when possible, and involve people with different hypotheses.
What is p-hacking?
P-hacking is manipulating analysis until you find statistical significance—trying multiple tests, dropping outliers, or stopping when results look good.
Why is correlation not causation?
Correlation shows variables move together, but doesn't prove one causes the other—could be reverse causation, coincidence, or a third variable.
What are confounding variables?
Confounding variables are hidden factors that influence both measured variables, creating false appearance of direct relationship.
How do you detect if you're fooling yourself?
Results confirm prior beliefs too perfectly, you only looked for supportive evidence, analysis decisions weren't pre-specified, or findings seem too clean.
What is statistical significance?
Statistical significance means results are unlikely due to chance alone—but doesn't mean they're large, important, or practically meaningful.
How do you improve data interpretation?
Learn basic statistics, pre-register analyses, seek alternative explanations, use replication, understand limitations, and maintain epistemic humility.