In 1854, physician John Snow removed the handle from the Broad Street water pump in London's Soho district. The cholera outbreak in the neighborhood had killed over 500 people in ten days. After the pump handle was removed, deaths stopped.

Snow had traced the epidemic to contaminated water--not "bad air" (miasma), the prevailing medical theory of the era--by meticulously mapping deaths on a street grid and interviewing surviving family members about their water sources. He noticed that deaths clustered tightly around the Broad Street pump. He noticed that workers at a local brewery, who were provided beer as part of their wages and rarely drank the local water, had not died. He noticed that residents of a workhouse, which had its own well, had not died either. No p-values. No regression analysis. Just careful contextual interpretation of data that was visible to anyone--and that had been completely misunderstood by the medical establishment for weeks.

Today, organizations have access to more data in a single day than Snow could have examined in a lifetime. Yet misinterpretation remains epidemic. Analysts confuse correlation with causation, ignore sample sizes, cherry-pick favorable results, build conclusions on biased data, and present findings that crumble the moment someone asks a probing question. The data is often not the problem. The interpretation is.

John Snow didn't succeed because he had better data. He succeeded because he asked the right questions, questioned the prevailing assumptions, and followed the evidence where it led rather than where he wanted to go.

'John Snow did not have better data than the London medical establishment. He had better questions and no allegiance to the prevailing wrong answer. The willingness to follow the evidence where it leads, rather than where you want to go, is what separates useful data interpretation from expensive confirmation bias.' -- Charles Wheelan, author of 'Naked Statistics: Stripping the Dread from the Data' (2013)

The Interpretation Framework: Context Before Calculation

The most consequential mistakes in data interpretation are made before a single calculation is performed. Context shapes everything that follows. A number without context is not information--it is a fact suspended in air, ready to be grabbed and misused.

Before any analysis, five questions establish the context that makes interpretation possible.

How was this data collected? The collection method determines what the data can and cannot tell you. A voluntary customer satisfaction survey captures opinions of people who fill out surveys--not customers generally. People who respond to satisfaction surveys tend to have stronger feelings than the median customer (either quite satisfied or quite dissatisfied). Web analytics tracking JavaScript-enabled browsers miss a portion of visitors who use browser extensions that block analytics, who have JavaScript disabled, or who use certain mobile browsers. Server logs include bot traffic that analytics tools exclude. Each collection method introduces its own systematic inclusions and exclusions.

Example: In the 2016 US presidential election, pre-election polls consistently predicted a Clinton victory by margins that turned out to be substantially wrong. Post-election analysis revealed systematic non-response bias: Trump supporters were significantly less likely to participate in polls, partly due to distrust of mainstream media organizations conducting the polls. The polling methodology was technically sound. The sample was systematically skewed. The data collection method introduced the bias that invalidated the conclusions.

What population does this data represent? Every dataset is a sample from some larger population. Understanding that population--and its precise boundaries--prevents overgeneralization. A study of user behavior among Stanford computer science students tells you about Stanford CS students. Applied carefully, it might generalize cautiously to students at elite engineering programs. It cannot be generalized to programmers at large, technology workers broadly, or knowledge workers generally without substantial additional validation.

This error is pervasive in business analytics. A company surveys its most active users about feature preferences and uses the results to guide product decisions for all users. Active users are not representative of typical users--they interact with the product more, understand it more deeply, and have different needs and preferences. The conclusions drawn from active users will systematically fail to serve typical users.

What is missing from this data? The most important question in data interpretation. Missing data is almost never random. People who cancel subscriptions rarely fill out exit surveys (they have already made their decision and have no incentive to explain it). Patients who die are excluded from recovery statistics. Companies that fail disappear from industry benchmarks, leaving only survivors whose characteristics are systematically different. Customers who are sufficiently dissatisfied have already churned and are absent from customer satisfaction surveys.

This systematic pattern--where the people and events most relevant to a question are also the ones most likely to be missing from the data--is called survivorship bias, and it produces analyses that look at partial evidence and draw complete conclusions.

What timeframe does this data cover? A company showing 40% revenue growth is impressive unless the comparison period was a COVID lockdown trough following a 60% decline. A product showing declining engagement this quarter may be seasonal. A metric improving month-over-month may be reversing a longer-term deteriorating trend. The time period selected for analysis dramatically affects interpretation, and the choice of start and end points is a design decision that introduces potential bias.

What external factors might influence these numbers? Data does not exist in a vacuum. A spike in customer support tickets might reflect a product bug, a spike in new users unfamiliar with the product, a PR crisis driving concerned existing users to reach out, or a viral moment bringing in new users with different needs. A week of unusually high revenue might coincide with a competitor's service outage. Attribution of changes in metrics to internal causes without examining external context produces confident-sounding wrong explanations.

Interpretation Error Description Diagnostic Question
Correlation as causation Assuming co-movement implies a causal relationship What third factor could drive both variables?
Survivorship bias Drawing conclusions from only the survivors in a dataset Who or what is missing from this data?
Overgeneralization Applying conclusions from a specific sample to a broader population What population does this sample actually represent?
Temporal cherry-picking Selecting start/end dates that favor a preferred narrative What does the full available time series show?
Ecological fallacy Inferring individual behavior from aggregate-level patterns Does the aggregate relationship hold at the individual level?
Omitted variable bias Ignoring a confounding factor that drives the observed relationship What external factors might influence these numbers?

Correlation, Causation, and the Dangerous Space Between

The distinction between correlation and causation is the single most important concept in data interpretation. It is taught in every introductory statistics course. It is violated in virtually every business analytics meeting.

Correlation means two variables move together. Causation means one variable causes the other to change. These are fundamentally different claims with fundamentally different implications for decision-making.

The Three Mechanisms of Spurious Correlation

Confounding variables are the most common source of false causal claims. Both observed variables are driven by a third, unobserved factor that creates the appearance of a direct relationship between the two.

Classic example: cities with more police officers have higher crime rates. Nobody seriously argues that hiring police causes crime. Both are driven by population density and urbanization: larger cities have more crime and more police. The confounding variable (city size) drives both observed variables.

The business equivalent is just as common but less obvious. A consumer goods company found that customers who used their premium subscription had much higher lifetime value than those who did not. The team concluded that premium subscription features caused greater loyalty and invested heavily in subscription features. Years of analysis later: premium subscribers were already highly engaged, high-income users before subscribing. The premium subscription was a symptom of existing engagement, not the cause of it. Promoting the subscription to disengaged customers did not produce the engagement the premium feature analysis had implied.

Reverse causation means the assumed direction of causality is backwards. Companies with strong cultures have higher revenue. But does culture cause revenue? Or does financial success--and the reduced stress, better compensation, and greater opportunities that come with it--enable investment in cultural development? Likely both, but the direction of the primary causal arrow matters enormously for what interventions will be effective.

This error is particularly dangerous in HR and management analytics. Employees at high-performing companies are more engaged, by virtually every measure. Copying the practices of high-performing companies to improve engagement will produce disappointing results if the engagement flows primarily from being at a successful company rather than from the specific practices.

Selection effects create artificial correlations through the filtering process that determines who or what appears in the dataset. Among NBA basketball players, height and skill as measured by performance statistics are not strongly positively correlated. This seems counterintuitive--taller players have obvious advantages in basketball. But the selection process for the NBA filters out short players who are not extremely skilled. The short players who make the NBA are extraordinary outliers in skill among short people. The selection effect creates an artificial correlation within the selected population.

In business: the characteristics that get startups funded by top-tier venture capital firms are not the same characteristics that make them succeed. The selection process that produces funded companies is correlated with, but not identical to, the process that produces successful companies. Studies of characteristics shared by successful VC-backed companies often confuse characteristics that attract VC attention with characteristics that drive business success.

Establishing Causation

When you need causal evidence--when the question is not "do these move together?" but "does this cause that?"--correlation is insufficient. True causal inference requires:

  1. Temporal precedence: the cause must precede the effect in time
  2. Covariation: changes in the cause correspond to changes in the effect
  3. Elimination of alternatives: plausible confounding explanations are ruled out

The gold standard is the randomized controlled trial (RCT): randomly assign units (users, customers, locations) to treatment and control groups, apply the intervention to the treatment group only, and measure the difference. Random assignment ensures that any differences in outcomes are attributable to the intervention rather than to pre-existing differences between groups. A/B testing in software products is the business application of RCT methodology.

When RCTs are not feasible, quasi-experimental methods provide weaker but useful causal evidence:

Difference-in-differences: compare the change in an outcome for an affected group before and after a change to the change for an unaffected group over the same period. Used extensively in economics to evaluate policy changes.

Regression discontinuity: when treatment is assigned based on a threshold (only customers above a certain value score receive a retention intervention, only students above a certain test score receive a scholarship), compare outcomes just above and just below the threshold. Near-threshold units are essentially randomly assigned.

Instrumental variables: use a third variable (the instrument) that affects the treatment but does not independently affect the outcome, to isolate causal variation in the treatment.

None of these methods is as clean as a randomized experiment. All are better than observational correlation alone.

Statistical Significance: The Most Misunderstood Concept in Business

The p-value is the most misunderstood and misused statistic in business analytics. Understanding it correctly is not optional for anyone interpreting data.

A p-value is the probability of observing results as extreme as the data, or more extreme, assuming the null hypothesis is true. A p-value of 0.03 means: if there were truly no effect, we would see results this extreme about 3% of the time purely by chance.

What p-values do NOT mean:

  • The probability that the null hypothesis is true (a common, dangerous misinterpretation)
  • The probability that the result is real or will replicate in independent data
  • That the effect is meaningful, important, or worth acting on
  • That the result is practically significant for the business

Statistical vs. Practical Significance

Statistical significance is not practical significance. With a large enough sample, trivially small effects become statistically significant. With a small enough sample, practically large effects may not reach statistical significance.

Example: An e-commerce company with 10 million visitors per variant on an A/B test detects a 0.003% improvement in conversion rate with p < 0.001. Statistically significant? Yes, definitively. The evidence that the effect is non-zero is overwhelming. Practically meaningful? Almost certainly not. If conversion is currently 3.0%, a 0.003% improvement takes it to 3.003%. On 10 million visitors, this might represent a few hundred additional conversions. The engineering time to implement the change likely exceeds the lifetime value of those conversions.

The correct response is to report the effect size (0.003% absolute improvement, 0.1% relative improvement) alongside the p-value and evaluate practical significance independently of statistical significance. A 95% confidence interval (3.001% to 3.005%) helps communicate the precision of the estimate.

Conversely: an A/B test showing 22% improvement in conversion rate with p = 0.12 (50 visitors per variant) is not statistically significant but may represent a genuinely large, practically important effect that the sample was too small to detect reliably. The correct response is to run a larger test, not to dismiss the finding.

The Multiple Comparison Problem

Every statistical test has a false positive rate--the probability of detecting a "significant" effect when no real effect exists. At a 5% significance threshold, running 20 independent tests produces on average one false positive even when all null hypotheses are true.

Business analysts frequently run many tests and report the significant ones. An analyst who tests 10 different audience segments, 5 different time windows, 3 different outcome metrics, and 4 different statistical approaches has tested effectively 600 combinations. The probability that at least one produces a "significant" result is nearly 100% even if nothing is actually different.

Prevention requires specifying in advance which test is the primary test, adjusting significance thresholds for the number of comparisons (Bonferroni correction divides the threshold by the number of tests), or treating any unplanned exploratory finding as a hypothesis requiring independent confirmation rather than a conclusion.

Mean vs. Median: Choosing the Right Summary Statistic

The arithmetic mean (average) is the most commonly reported summary statistic in business. It is also the most easily misrepresented, because it is not robust to outliers.

Example: Ten employees earn $55,000 per year. Mean salary: $55,000. Median salary: $55,000. Now the CEO earns $5,000,000. Mean salary: approximately $500,000. Median salary: $55,000. The mean has been transformed by a single outlier into a number that describes no actual employee's experience.

When distributions are skewed--when a long tail in one direction pulls the mean toward extreme values--the median is a more representative description of the typical case. Most real-world distributions in business are right-skewed: a few very large customers, a few extremely long support calls, a few unusually high-value transactions. The mean of a right-skewed distribution is always higher than the median, and often much higher.

Report median and percentiles for skewed data. Revenue per customer: report the median and the 75th, 90th, and 95th percentiles. Response time: report the 50th percentile (median), the 90th percentile, and the 99th percentile. Percentile reporting is common in engineering and SRE contexts precisely because it reveals the distribution of experience rather than hiding it in an average.

Other average-related traps:

The ecological fallacy: drawing conclusions about individuals from group-level data. The average income in a wealthy ZIP code does not mean every resident is wealthy. High-income and low-income households coexist in most geographic areas. Policy decisions or marketing targeting based on area-level averages will systematically mischaracterize individual-level variation.

Aggregation hiding variation: a restaurant averaging 100 covers per day masks the 150 covers on Friday and Saturday versus 60 on Tuesday. Staffing to the daily average ensures chronic understaffing on peak days and overstaffing on slow days. The right decision-making data is the distribution, not the mean.

Unweighted averaging: averaging percentages across groups of unequal size produces misleading results. Averaging a 60% conversion rate for a 1,000-visit campaign with a 30% conversion rate for a 100,000-visit campaign gives 45%--but the true blended rate is approximately 30.3%, much closer to the larger campaign's rate.

Handling Missing Data

Missing data is present in virtually every real-world dataset. The way data is missing determines the appropriate handling strategy and the conclusions that can be drawn.

Missing Completely at Random (MCAR): the probability of missing data is unrelated to any observed or unobserved variable. A sensor fails due to random hardware defect, creating random gaps in a time series. Dropping incomplete records introduces no systematic bias.

Missing at Random (MAR): missingness depends on observed variables but not on the missing value itself. Survey respondents under 35 skip income questions more frequently than respondents over 50. After accounting for age, income missingness is random. Statistical methods that adjust for the observed predictors of missingness can recover unbiased estimates.

Missing Not at Random (MNAR): missingness depends on the missing value itself. High earners refuse to report income specifically because of how much they earn. Patients with the worst outcomes drop out of clinical trials. This is the most dangerous and most common type in practice. Analysis that ignores MNAR produces biased estimates that are difficult to correct because the source of bias is by definition unobserved.

When reporting any analysis, the amount of missing data and the handling approach should be disclosed. "3% of records were excluded due to missing values" is insufficient. "3% of records had missing values in the primary outcome variable. These records were more likely to come from users in emerging markets (18% missing vs. 1% missing in North America), suggesting the results may not generalize to these markets" is appropriate transparency.

Extrapolation extends observed trends beyond the range of available data. It is seductive, frequently necessary, and reliably dangerous when applied without understanding the mechanism underlying the trend.

In 2007, a simple linear extrapolation of smartphone adoption rates would have projected universal penetration within a few years. The projection was roughly correct. The same approach applied to social media platform growth in 2010 would have projected Facebook reaching 10 billion users by 2015. The projection was wrong because growth rates always decelerate as markets saturate.

Growth curves for products, populations, and technologies do not continue linearly or even exponentially indefinitely. They follow S-curves: rapid early growth, a period of acceleration, inflection, deceleration, and eventual saturation. Linear extrapolation of any phase of an S-curve produces systematically wrong predictions.

The test of a safe extrapolation is understanding why the trend exists. If you understand the mechanism--why early adopters are adopting at this rate, why the market has this growth potential, why retention behaves as observed--you can extrapolate with calibrated confidence, knowing which assumptions could break the projection. If you only observe that a trend exists without understanding why, extrapolation is speculation wearing the costume of analysis.

The Red Team Approach to Data Interpretation

Confirmation bias--the tendency to seek, interpret, and remember information that confirms existing beliefs--is the most pervasive cognitive error in business analytics. Analysts genuinely searching for the truth are still systematically affected by their prior expectations, the conclusions their team wants to find, and the career incentives attached to specific outcomes.

The red team approach addresses this structurally. Before making decisions based on any significant analysis, assign someone (ideally from outside the team that produced the analysis) to argue against the conclusion. Their job:

  • Find the simplest alternative explanation for the observed pattern
  • Identify missing data that might reverse the conclusion
  • Challenge whether the sample is representative of the relevant population
  • Verify that pre-specified analysis plans were followed
  • Check whether multiple tests were run and only significant ones reported
  • Ask whether the conclusion changes if you look at the full time series rather than the selected window

This adversarial review catches interpretation errors that collaborative analysis consistently misses, precisely because the reviewer's incentives are different.

Data interpretation is ultimately about intellectual honesty: the willingness to follow evidence to unwelcome conclusions, to acknowledge uncertainty when it exists, and to say "the data doesn't support a clear answer" when it doesn't. Organizations that cultivate this honesty make better decisions not because they have better data but because they read it more truthfully--in the tradition of John Snow mapping deaths on Broad Street, following the evidence to a conclusion that overturned decades of wrong medical consensus.

See also: Analytics Mistakes Explained, Measurement Bias Explained, Visualization Best Practices

What Research Shows About Data Misinterpretation

The systematic study of how organizations misinterpret data has produced sobering findings. A landmark body of work by Amos Tversky and Daniel Kahneman, beginning in the 1970s, documented the cognitive biases that lead intelligent people to draw wrong conclusions from statistical evidence. Their 1974 paper "Judgment Under Uncertainty: Heuristics and Biases," published in Science, established that even trained statisticians apply representativeness heuristics that produce systematic errors when interpreting data -- findings that Kahneman summarized for a general audience in his 2011 book Thinking, Fast and Slow. The relevance for business analytics is direct: the heuristics Tversky and Kahneman documented are particularly prevalent when analysts are under time pressure, when findings are presented to support a predetermined conclusion, and when base rates conflict with vivid anecdotal evidence.

Nate Silver, founder of FiveThirtyEight and author of The Signal and the Noise (2012), documented the specific misinterpretation failure mode that affects organizational forecasting: overconfidence in model accuracy. Silver reviewed decades of political forecasting, economic prediction, and sports analytics and found a consistent pattern -- experts using quantitative models were systematically more confident in their predictions than their accuracy warranted. The calibration problem (where a forecast of "70 percent probability" should be right approximately 70 percent of the time) was rarely satisfied in practice. Silver's proposed corrective -- Bayesian updating, where prior beliefs are revised as new evidence arrives -- has since been adopted as a best practice in analytics teams at organizations including Google, Netflix, and several major financial institutions.

Research specifically examining A/B test misinterpretation at technology companies has produced alarming findings. Ron Kohavi's team at Microsoft studied how analysts and product managers interpreted A/B test results and found systematic errors in the majority of cases. The most common: stopping tests early when results looked positive (which dramatically increases false positive rates due to the peeking problem), ignoring practical significance in favor of statistical significance, and failing to account for novelty effects (where users engage with a new feature initially simply because it is new, creating a temporary positive signal that disappears). Kohavi estimates that premature test stopping alone causes organizations to ship product changes that reduce user engagement in a substantial fraction of cases.

Philip Tetlock's multi-decade study of expert forecasting, summarized in his 2005 book Expert Political Judgment and expanded in his 2015 collaboration with Dan Gardner in Superforecasting, found that subject matter experts were generally no more accurate than informed laypeople when making quantitative predictions in their domain -- and in some cases were worse, because domain expertise often produced overconfidence. Tetlock's research identified a class of forecasters he called "superforecasters" who consistently outperformed experts: they shared a habit of decomposing complex questions into component probability estimates, explicitly tracking the accuracy of their past predictions, and aggressively updating beliefs when new data contradicted prior expectations. These habits -- decomposition, calibration tracking, and active updating -- are the operationalization of sound data interpretation principles.

Real-World Case Studies in Misinterpretation

The Space Shuttle Challenger: Misreading Risk Data. On January 27, 1986, engineers at Morton Thiokol argued against the Challenger launch on the grounds that O-ring performance degraded at low temperatures. NASA managers requested data. The engineers provided charts showing O-ring incidents at various launch temperatures. The managers reviewed the charts and saw no clear pattern. The launch proceeded. The shuttle disintegrated 73 seconds after launch.

The post-accident analysis by Richard Feynman and the Rogers Commission identified a critical interpretation failure: the engineers had only charted data from launches that experienced O-ring incidents, omitting the many successful launches with no incidents. When statistician and visualization expert Edward Tufte later reconstructed the analysis with all available data -- plotting O-ring damage against temperature for every previous launch, including the many with no damage -- the pattern was stark and unambiguous: damage probability increased strongly as temperature declined, with the proposed launch temperature of 29 degrees Fahrenheit extrapolating far outside safe operating range. The correct analysis had been available. The misinterpretation came from incomplete data presentation and from managers who read absence of a pattern in incomplete data as evidence of no pattern in complete data.

Google Flu Trends: The Overconfidence of Big Data. In 2008, Google launched Google Flu Trends, a system that predicted influenza activity in the United States using search query data. The system performed well initially, prompting widespread enthusiasm about "big data" approaches to public health monitoring. A 2009 paper in Nature by Jeremy Ginsberg and colleagues reported that Google Flu Trends could predict CDC flu levels with 97 percent accuracy, two weeks faster than traditional surveillance methods.

By 2013, the system was significantly overpredicting flu activity -- at one point estimating flu prevalence at roughly double the CDC's measured level. A 2014 analysis by David Lazer, Ryan Kennedy, Gary King, and Alessandro Vespignani in Science identified the cause: Google had continuously updated its search algorithm, changing what search terms its system tracked without recalibrating the prediction model. The model had been trained on a specific relationship between search terms and flu rates that changed when search behavior and Google's own ranking system changed. The authors used this case to articulate the "big data hubris" failure mode -- the assumption that large data volume substitutes for careful model validation. The system was quietly discontinued in 2015.

Facebook's Emotional Contagion Experiment: Incomplete Population Disclosure. In 2014, Facebook published a study in the Proceedings of the National Academy of Sciences showing that experimental manipulation of the emotional valence of users' news feeds produced measurable effects on the emotional content of their own posts. The study was methodologically competent. The controversy arose from its interpretation: the study's results were presented as evidence that emotional contagion occurs on social media platforms, with the implication that the effects were substantial and generalizable. Critics including statistician Andrew Gelman noted that the effect sizes were tiny -- the study manipulated the feeds of 689,003 users and detected statistically significant but practically negligible changes in emotional expression. The sample size was large enough to detect effects too small to be behaviorally meaningful. The finding was real; the significance attributed to it was not. The case has been widely used in research methods courses to illustrate the distinction between statistical significance and practical significance in large-sample studies.

The 2008 Financial Crisis: Misinterpreting Risk Model Outputs. The widespread adoption of Value-at-Risk (VaR) models by major financial institutions in the years before 2008 provides perhaps the most consequential example of data misinterpretation in modern history. VaR models estimated the probability of large losses by fitting statistical distributions to historical return data. The models consistently showed that catastrophic losses were extremely improbable -- 99th-percentile events that would occur roughly once per century. Nassim Nicholas Taleb argued in his 2007 book The Black Swan (published, with notable timing, one year before the crisis) that VaR models systematically underestimated tail risk by fitting thin-tailed distributions to data that actually had fat tails, and by training on historical periods that excluded the most extreme historical events. When the crisis occurred, losses exceeded 99th-percentile VaR estimates at multiple major institutions simultaneously -- an outcome the models predicted was nearly impossible. The misinterpretation was not in the arithmetic; it was in treating model assumptions as descriptions of reality rather than as simplifications whose failure modes needed to be understood.

Common Mistakes and What Evidence Shows About Prevention

Mistake 1: Confusing Absence of Evidence with Evidence of Absence. The Challenger case exemplifies a failure documented in experimental psychology literature: humans systematically underweight non-events relative to events. An O-ring incident is visible and memorable; a launch with no incident produces no data point in an analyst's mental model. In business analytics, this manifests as over-indexing on customer complaints (visible) while ignoring satisfied customers (invisible), tracking conversion events while ignoring non-conversion, and analyzing churned customers while having no data on customers who considered churning but stayed. Nate Silver's prescription -- building null models that estimate what you would observe if there were no effect, and comparing observed data to the null -- forces analysts to confront what absence of evidence actually implies.

Mistake 2: Anchoring on Initial Analysis. Kahneman and Tversky documented anchoring -- the tendency to rely too heavily on the first piece of information encountered when making decisions -- as one of the most robust cognitive biases in quantitative judgment. In analytics, this manifests as treating the first analysis produced as the baseline that subsequent analyses must explain, rather than treating each analysis as one of many possible framings. Andrew Gelman, Columbia University professor of statistics and author of Regression and Other Stories, advocates for "forking paths" analysis -- explicitly considering which analytical choices could have been made differently and whether the conclusion changes under alternative reasonable choices. This approach, related to the "multiverse analysis" methodology developed in psychology replication research, makes analytical fragility visible rather than hiding it in a single reported result.

Mistake 3: Simpson's Paradox in Aggregated Data. Simpson's Paradox -- where a trend present in each subgroup of data reverses when the groups are combined -- is documented in statistical literature going back to E.H. Simpson's 1951 paper in the Journal of the Royal Statistical Society. It appears with surprising frequency in business analytics, particularly in conversion rate analysis, clinical trial interpretation, and educational performance comparisons. The famous UC Berkeley gender bias study (Bickel, Hammel, and O'Connell, 1975) found that aggregate admissions data appeared to show bias against women, but when broken down by department, most departments showed slight bias toward admitting women -- the aggregate pattern arose because women disproportionately applied to more competitive departments. Prevention requires disaggregating data to the level at which decisions are actually made before drawing conclusions from aggregate statistics.

References

Frequently Asked Questions

What are the most common mistakes when interpreting data?

Common interpretation errors: (1) Confusing correlation with causation—assuming relationships are causal, (2) Cherry-picking data—selecting data that supports conclusions, ignoring contradictions, (3) Ignoring sample size—drawing conclusions from too little data, (4) Misunderstanding averages—using mean when median is more appropriate, (5) Ignoring context—analyzing data without understanding what it represents, (6) Survivorship bias—analyzing only successful cases, (7) Simpson's paradox—trends that appear in groups disappear when combined, (8) p-hacking—testing until you find significant results, (9) Extrapolating beyond data—assuming trends continue indefinitely, (10) Ignoring data quality—analyzing flawed data. These mistakes lead to incorrect conclusions and poor decisions despite having data.

What's the difference between correlation and causation, and why does it matter?

Correlation means two variables change together—when one increases, the other tends to increase (positive correlation) or decrease (negative correlation). Causation means one variable directly causes changes in another. Why this matters: correlated variables often have no causal relationship—both might be caused by a third factor, the relationship might be coincidental, or causation might go the opposite direction. Example: ice cream sales correlate with drowning deaths, but ice cream doesn't cause drowning—both increase in summer. Acting on correlation as if it's causation leads to ineffective interventions. Establishing causation requires: temporal precedence (cause before effect), controlled experiments, or strong theoretical models. Be skeptical of causal claims based only on observational correlations.

How do you account for context when interpreting data?

Context considerations: (1) How was data collected—sampling method affects interpretation, (2) What's being measured—understand definitions and calculation methods, (3) Time period—when data was collected matters, (4) External factors—what else was happening that might affect data, (5) Who/what is included—understand population represented, (6) Changes over time—data collection or definitions may have changed, (7) Industry or domain norms—what's normal in this context, (8) Statistical significance vs practical significance—is difference meaningful, (9) Comparison baselines—compared to what?, (10) Limitations and caveats—what doesn't this data tell us. Same data interpreted differently in different contexts. Always ask: what's the story behind these numbers?

What is statistical significance and why is it misunderstood?

Statistical significance indicates whether an observed effect is likely due to chance or represents a real pattern. P-value < 0.05 is common threshold (5% probability result is random). Misunderstandings: (1) Significant ≠ important—tiny effects can be statistically significant with large samples, (2) Not significant ≠ no effect—might lack statistical power to detect real effects, (3) P-values don't indicate effect size—only whether it's likely real, (4) Multiple testing inflates false positives—testing many hypotheses makes spurious significance likely, (5) P-hacking—trying analyses until finding significance. Proper interpretation: combine statistical significance with effect size, confidence intervals, and practical significance. Ask: is this effect large enough to matter, not just 'is it significant?'

How do you avoid cherry-picking data to support predetermined conclusions?

Avoiding cherry-picking: (1) Define analysis plan before looking at data—decide what you'll test and how, (2) Look at all data, not just convenient subsets, (3) Report all analyses conducted, not just significant ones, (4) Use holdout data—test conclusions on fresh data not used in analysis, (5) Seek contradictory evidence—actively look for data against your hypothesis, (6) Peer review—have others check your analysis, (7) Document decisions—explain why you filtered or transformed data, (8) Be transparent about limitations—acknowledge what data doesn't show. Intellectual honesty is crucial—goal is finding truth, not confirming what you want to believe. Confirmation bias is strong; systematic processes combat it. Pre-register analysis plans for important decisions.

What role does sample size play in data interpretation?

Sample size affects: (1) Statistical power—larger samples detect smaller effects, (2) Confidence intervals—larger samples give narrower ranges, (3) Reliability—small samples are noisy, patterns might be random, (4) Representativeness—larger samples more likely to represent population. Common mistakes: over-interpreting small samples (trends in 10 data points aren't reliable), ignoring that statistical significance requires adequate sample size, and assuming large samples make all results meaningful (even tiny, meaningless effects become significant). Rule of thumb: be very skeptical of conclusions from samples under 30; patterns become reliable around 100+; very large samples (10,000+) detect tiny effects that may not matter. Always report sample sizes alongside findings so others can judge reliability.

How should you handle missing or incomplete data in analysis?

Handling missing data: (1) Understand why data is missing—random vs systematic missingness matters, (2) Analyze complete cases—use only records with all needed data (reduces sample size), (3) Imputation—fill missing values with estimates (mean, median, or model predictions), (4) Missing indicators—create flag showing data was missing, (5) Multiple imputation—create several datasets with different imputations, (6) Sensitivity analysis—test if conclusions change based on assumptions about missing data. Never ignore missing data—it biases results. If data is missing non-randomly (systematically), analysis of complete cases is biased. Example: if high earners don't report income, average income from complete data underestimates true average. Always report how much data is missing and how you handled it.