In the early 2000s, Hewlett-Packard's HR analytics team discovered that employees who lived far from headquarters were significantly more likely to quit within their first year. The correlation was strong, consistent across years of data, and statistically significant. Management responded by changing recruiting practices to favor candidates who lived closer to the office.

Years later, a more careful analysis revealed the confounding variable they had missed: HP's campuses in certain regions offered less competitive compensation than local market rates. Employees in those areas both lived farther away (commuting from cheaper suburbs) and were more likely to leave (for better-paying local competitors). Proximity to the office had nothing to do with retention. The entire policy change was built on a statistical artifact.

This is what makes analytics mistakes so dangerous. They don't announce themselves. They arrive wearing the clothes of data-driven rigor, with correlation coefficients, confidence intervals, and beautifully formatted charts. The most destructive analytics errors are not miscalculations--they are correct calculations applied to the wrong question, correct observations explained by the wrong mechanism, or correct statistics used to mislead.

The most dangerous analytics mistakes are not obvious errors. They are subtle, systematic, and seductive--they produce results that look right, feel right, and are completely wrong.

Correlation Is Not Causation: The Most Pervasive Error

The single most widespread analytics mistake is treating correlation as causation. Two variables that move together do not necessarily have a causal relationship. This is taught in every introductory statistics class and violated in virtually every business analytics meeting.

Correlation without causation occurs through three primary mechanisms:

Confounding variables occur when a third factor drives both observed variables simultaneously. The classic example: ice cream sales and drowning deaths correlate strongly. Nobody believes ice cream causes drowning. Summer heat (the confound) increases both. Yet in business analytics, equally absurd causal claims go unchallenged because the confounding factor is less obvious--and because organizations have incentives to believe the causal story that justifies their preferred action.

Example: A consumer electronics company found that customers who registered their products online had dramatically higher customer lifetime value than those who didn't. The product team concluded that online registration caused loyalty and built an aggressive registration nudge campaign. Sales didn't move. The actual causal structure: engaged, tech-savvy customers both registered their products AND bought more products. Registration was a symptom of the underlying cause (customer engagement), not the cause itself. Prompting disengaged customers to register didn't make them engaged.

Reverse causation occurs when the assumed direction of causality is backwards. Companies that invest heavily in employee wellness programs tend to have healthier, more productive employees. But do wellness programs create health? Or do financially successful, low-stress organizations both invest in wellness programs (because they can afford to) and have healthier employees (because their work environment is better)? Observational data cannot distinguish these directions.

This mistake has cost the healthcare industry billions. Numerous observational studies showed that patients who adhered to their medication regimens had better health outcomes. The obvious conclusion: medication adherence improves health. Randomized trials repeatedly showed that the true causal structure was more complex--healthy, organized patients both adhered to medications and made other health-promoting choices. Simply getting patients to take pills didn't produce the outcome gains seen in the observational studies.

Spurious correlations from coincidence at scale are mathematically inevitable when enough variables are examined. Tyler Vigen's "Spurious Correlations" project documented that US spending on science, space, and technology correlates with suicides by hanging, suffocation, and strangulation at r = 0.998 from 1999 to 2009. Nicolas Cage film releases per year correlate with pool drownings (r = 0.666). The correlations are mathematically real; the causal relationships are nonexistent.

In business, the same phenomenon produces false signals whenever analysts explore large datasets without pre-specified hypotheses. Run enough correlations and you will find significant ones. The question is whether they mean anything.

In 2008, Google launched Google Flu Trends, predicting flu outbreaks based on search query volume. Initial results impressed everyone: the system predicted CDC reports two weeks ahead, with correlations that seemed to validate the approach. Google published results in Nature and positioned Flu Trends as a demonstration of big data's potential to transform public health.

By 2013, the model was predicting double the actual flu cases. What happened? Media coverage of flu seasons had increased over the years, driving more searches about flu that reflected anxiety rather than illness. The model had been trained on a period when search behavior and flu incidence moved together. When the relationship between concern about flu and actual flu cases changed, the model failed catastrophically.

The lesson extends far beyond public health analytics. Correlations that work as proxies are fragile. When you don't understand the causal mechanism connecting your proxy to the outcome you care about, you cannot predict when the proxy will fail. Google Flu Trends was quietly discontinued in 2015.

Mistake Core Error Classic Example Prevention
Correlation as causation Inferring cause from co-movement HP linking commute distance to attrition Identify confounding variables
Survivorship bias Sampling only successful outcomes Studying startup winners, ignoring failures Include failure data explicitly
P-hacking Manipulating analysis until significance appears Testing 20 subgroups, reporting the one that works Pre-register analysis plans
Goodhart's Law Optimizing the proxy instead of the outcome Call centers cutting calls short to reduce AHT Use complementary metrics
Simpson's Paradox Aggregating groups with different compositions Berkeley admissions appearing biased in aggregate Analyze at multiple levels
Cherry-picking Selecting data that confirms prior belief Reporting only favorable metric from a feature launch Report all pre-specified metrics
Base rate neglect Ignoring population frequency 99% accurate fraud model with 91% false positives Always state base rates alongside accuracy

Sample Size: The Silent Killer of Reliable Insights

Human brains are pattern-detection machines. Show someone a coin that lands heads three times in a row, and they construct narratives about biased coins. Three observations is enough for the brain to see a pattern. For statistics, it is nowhere near enough.

The Statistics of Small Samples

Common small-sample failures in business analytics:

A/B tests with insufficient traffic. A test showing a 20% conversion improvement with 50 visitors per variant is noise, not signal. At that sample size, a true conversion rate of 3% can easily produce an observed rate of 2% or 4% by random chance. The same test with 5,000 per variant produces genuinely reliable conclusions.

Customer surveys with low response rates. A survey sent to 10,000 customers that receives 200 responses captures the opinions of people who respond to surveys, not customers generally. Survey respondents are systematically different: they tend to have stronger feelings (either very satisfied or very dissatisfied), more time, and higher engagement with the product. The 200 responses may be statistically analyzable but represent a biased sample.

Quarterly reports based on small counts. "Enterprise sales increased 50% quarter-over-quarter" sounds impressive until you realize it went from 4 deals to 6. With sample sizes that small, the variance dominates the signal entirely.

Statistical power is the probability that a test correctly detects a real effect when one exists. Most business A/B tests are grossly underpowered.

To detect a 5% relative improvement in a 3% conversion rate with 80% power and 95% confidence, you need approximately 78,000 visitors per variant. Run the same test with 5,000 visitors per variant and you have roughly 20% power--meaning you will miss 80% of real effects that size. Most companies run tests with a fraction of adequate sample size and celebrate results that are statistically indistinguishable from random noise.

Evan Miller, creator of a widely-used sample size calculator, demonstrated that the common practice of "peeking" at test results daily and stopping when p < 0.05 inflates false positive rates from the nominal 5% to 30% or higher. A test that appears to show a significant result after early stopping often shows no significance when completed at full sample size.

Practical sample size guidance:

  • Below 30 observations: statistical tests are unreliable; use only for initial exploration
  • 100-300 observations: basic pattern detection becomes possible
  • 1,000+ observations: subgroup analysis starts being meaningful
  • 10,000+ observations: small effect sizes become detectable--and here you must ask whether statistically significant effects are practically significant for the business

Survivorship Bias: Studying Only the Winners

During World War II, the US military examined bombers returning from combat missions to determine where to add armor. Bullet hole patterns clustered on wings, fuselage, and tail sections. The obvious conclusion: reinforce the areas that are getting hit.

Mathematician Abraham Wald at Columbia University's Statistical Research Group saw the flaw immediately. The military was studying only the planes that survived. The areas without bullet holes on returning aircraft were the areas where hits were fatal--planes hit there never returned. Wald recommended armoring the engines and cockpits, the areas showing the fewest hits on surviving aircraft.

This insight--that your sample is not drawn randomly from the population of all outcomes, only from the population of outcomes that make it into your dataset--pervades business analytics.

Startup success analysis. The venture capital industry has spent decades developing playbooks for identifying successful startups. These playbooks are invariably based on characteristics shared by successful companies. The problem: many of these characteristics are equally common among failed startups. Studying Y Combinator graduates who became unicorns ignores thousands with identical strategies, founders with similar backgrounds, and product ideas in comparable markets who failed anyway.

Mutual fund performance. Morningstar and similar services track active fund performance over time. But closed funds disappear from the data when they're shut down--which typically happens when performance is poor. The surviving funds show better average returns than the true average of all funds that ever existed. Investors who compare current funds to historical averages are comparing apples to a cherry-picked subset of apples.

Customer satisfaction surveys. Measuring satisfaction among current customers ignores customers who already churned. Dissatisfied customers leave; satisfied customers stay. The pool of remaining customers systematically skews satisfied, creating artificially positive survey results. Organizations that interpret these results as indicating overall product health are measuring the satisfaction of the people who haven't left yet.

Feature adoption metrics. When measuring how users interact with features they have specifically chosen to enable, you are measuring engaged users, not typical users. Features that look highly used among those who use them may be largely ignored by the broader user base.

Prevention requires explicitly asking: "What am I not seeing because it did not survive to appear in my dataset?" Include failure data in analyses. Track cohorts from their beginning, not just the surviving members. Compare your analytical sample to the full original population from which it was drawn.

Cherry-Picking and Confirmation Bias

Cherry-picking--selecting data that supports a predetermined conclusion while ignoring contradictory evidence--can be deliberate fraud or, more commonly, unconscious confirmation bias. Most cherry-picking in business analytics is the latter: analysts and decision-makers are not lying; they genuinely believe the story the data tells because they have assembled a dataset that confirms the story they wanted to be true.

Selective time periods are the most common form. A marketing team reports that website traffic increased 40% in Q3 without mentioning that traffic dropped 50% in Q2 due to a site migration, and the "increase" was partial recovery. A product team reports user engagement improving since a feature launch without acknowledging that the launch coincided with a competitor's service outage. The reported numbers are accurate; the picture they create is misleading.

Subgroup hunting transforms negative results into positive ones through multiple testing. An A/B test shows no significant overall effect. The analyst segments by age, gender, device type, geography, browser, and user tenure. After testing 20 subgroups, one shows significance at p < 0.05. By chance alone, at a 5% significance threshold, you expect one false positive for every 20 independent tests. This isn't analysis--it's finding noise and labeling it signal.

Selective metric reporting occurs when multiple metrics are tracked but only favorable ones are reported. A product team runs seven different measurements for a feature launch: time on page, click-through rate, conversion, retention at day 1, day 7, and day 30, and net promoter score. Six show no significant change. One shows a 12% improvement. The launch presentation highlights the single positive metric. The decision to continue the feature is made on the basis of a 1-in-7 success rate that could easily be chance.

The Texas Sharpshooter Fallacy

Named after the joke about a Texan who fires bullets into a barn wall and then paints a target around the tightest cluster, this fallacy describes finding patterns in random data and constructing narratives to explain them. The pattern is real; the explanation is post-hoc rationalization.

In analytics, this manifests as post-hoc storytelling. An unexpected spike appears in the data. The analyst constructs a plausible business story--a competitor had issues, a marketing campaign reached an unexpected audience, a seasonal effect aligned with a product change. The story feels compelling. It fits the data. It was never a hypothesis tested against pre-specified expectations; it was a coincidence that received a retroactive explanation.

Prevention requires separating exploration from confirmation: pre-register analysis plans before accessing data, report all metrics rather than only favorable ones, apply multiple comparison corrections (Bonferroni correction divides the significance threshold by the number of tests), use holdout validation by splitting data into exploration and confirmation sets, and require peer review of all consequential analyses before decision-making.

P-Hacking: Manufacturing Statistical Significance

P-hacking (also called data dredging or significance fishing) is the practice of manipulating analysis--often unconsciously--until statistically significant results appear. The name comes from the p-value, the probability of observing data at least as extreme as what was observed if the null hypothesis is true. A p-value below 0.05 is the conventional threshold for statistical significance.

Researchers at the University of Pennsylvania demonstrated p-hacking with a deliberately constructed study showing that listening to "When I'm Sixty-Four" by the Beatles made participants 1.5 years younger. The study was published in Psychological Science, a peer-reviewed journal, and was intentionally p-hacked to expose the problem.

Their techniques:

  1. Measured multiple dependent variables and reported only the one that reached significance
  2. Tested multiple conditions and reported only the significant comparison
  3. Progressively added covariates (father's age, mother's age) until significance appeared
  4. Stopped data collection when the p-value first dipped below 0.05

Each technique individually seems defensible. Combined systematically, they produced a "significant" result from data that contained no real effect at all.

The Replication Crisis

P-hacking is a primary driver of the replication crisis--the systematic failure of published scientific findings to be confirmed in independent replications. In 2015, the Open Science Collaboration attempted to replicate 100 published psychology studies. Only 36% produced significant results on replication. Similar failure rates have been documented in cancer biology, economics, nutrition science, and neuroscience.

Business analytics faces the same problem without the academic scrutiny. Companies make product decisions, marketing investments, and strategic choices based on "significant" A/B tests that were never replicated, never pre-registered, and often analyzed with multiple approaches until significance appeared. The decision is made. The product ships. Nobody checks whether the "winning" variant actually outperformed the control in subsequent measurements.

Prevention requires setting alpha levels, sample sizes, and analysis methods before collecting data; reporting effect sizes and confidence intervals alongside p-values; requiring independent replication on new data before major decisions; adopting Bayesian methods that are structurally more robust against stopping-rule violations; and creating a culture where null results are treated as valuable information, not failures.

Goodhart's Law: When Metrics Become Targets

Goodhart's Law states: "When a measure becomes a target, it ceases to be a good measure." British economist Charles Goodhart observed this in monetary policy in the 1970s. When central banks began targeting specific monetary aggregates, economic actors changed behavior to optimize those aggregates, destroying their usefulness as indicators.

The principle generalizes far beyond economics. Whenever an organization measures a proxy for something it cares about and optimizes the proxy, the proxy stops tracking the underlying thing.

Call center handle time. A telecommunications company measured call center performance by average handle time (AHT)--the time from call answer to call end. AHT is a reasonable proxy for efficiency when customer service is actually being provided. The company set targets and tied performance reviews to AHT. Agents responded by rushing calls, transferring customers unnecessarily, and in some cases ending calls prematurely. AHT improved. Customer satisfaction plummeted. Repeat call rates increased. The company was measuring efficiency; it optimized for the appearance of efficiency while destroying actual efficiency.

Lines of code. When IBM reportedly tracked programmer productivity by lines of code produced, developers wrote verbose, redundant code. A function that should be 10 lines became 50. Complex algorithms were replaced with simple-but-lengthy code. The metric improved; actual output declined.

Social media engagement. Facebook's internal research, revealed by whistleblower Frances Haugen in 2021, showed the company knew its engagement optimization was surfacing divisive and emotionally provocative content. Engagement was successfully maximized. So was the spread of misinformation, political polarization, and documented harm to teenagers' mental health. The metric was working. The thing the metric was supposed to measure--meaningful connections and positive user experience--was being destroyed.

Student test scores. Teaching to standardized tests improved scores while degrading broader learning, critical thinking, and genuine subject mastery. Donald Campbell, a social scientist, documented this pattern independently of Goodhart: "The more any quantitative social indicator is used for social decision-making, the more subject it will be to corruption pressures and the more apt it will be to distort and corrupt the social processes it is intended to monitor." Campbell's Law is a sociological generalization of Goodhart's.

Prevention requires tracking complementary metrics that are harder to game simultaneously--pair engagement with satisfaction, pair velocity with quality, pair revenue with retention. Use input metrics alongside output metrics, since inputs are harder to fake than outcomes. Rotate metrics periodically to prevent gaming strategies from calcifying. Maintain qualitative human assessment alongside quantitative tracking.

Simpson's Paradox occurs when a trend that appears in each of several groups of data reverses or disappears when the groups are combined into aggregate. It is one of the most counterintuitive phenomena in statistics, and it appears regularly in real business data.

The classic case is the 1973 UC Berkeley graduate admissions analysis. Overall admission rates appeared to show:

  • Men: 44% admitted
  • Women: 35% admitted

This looked like clear gender discrimination. The university faced legal scrutiny. But when admissions were examined department by department, women were admitted at higher rates than men in most departments. The paradox arose because women applied disproportionately to highly competitive departments with low overall acceptance rates, while men applied in higher proportions to less competitive departments with higher acceptance rates. The aggregate comparison combined unequal-sized groups in a way that created a misleading overall picture.

Business implications appear constantly:

An overall website conversion rate decline might mask improvement in every individual traffic channel--if the traffic mix shifted toward channels with structurally lower conversion rates, the aggregate rate falls even as each channel improves.

An analysis of salary differences by gender might show men receiving higher average raises across the whole company, while within every individual department women receive higher average raises--if more women were hired into lower-paying entry-level positions during the period, the aggregate comparison is dominated by compositional effects rather than pay equity.

A drug trial might show positive effects in every age subgroup but negative effects in the aggregate population--if the trial enrolled disproportionately many young participants in the treatment arm, and young participants have better baseline outcomes regardless of treatment.

Prevention requires examining data at multiple levels of aggregation and investigating whenever aggregate results tell a different story than disaggregated results. Understand the composition of your analytical groups before drawing conclusions. Apply causal reasoning, not just statistical pattern matching, to interpret what you find.

Base Rate Neglect: Ignoring Prior Probabilities

Base rate neglect (or base rate fallacy) occurs when specific information about an individual case is given too much weight relative to general information about the relevant population.

Example: A fraud detection system flags a transaction as potentially fraudulent with 99% accuracy. The base rate of fraud in the transaction population is 0.1%. In a dataset of 1 million transactions, 1,000 are truly fraudulent. The system correctly identifies 990 of them (99% true positive rate). It also flags 9,990 legitimate transactions as fraudulent (1% false positive rate applied to 999,000 legitimate transactions). Of the 10,980 transactions flagged, only 990 (9%) are actually fraudulent.

The 99% accuracy sounds impressive. The operational reality--that 91% of fraud alerts are false positives--is what the fraud investigation team actually experiences. Base rate neglect leads organizations to deploy systems that look accurate in headline terms but are operationally unworkable.

This plays out in business analytics whenever an impressive model metric is presented without reference to the base rate of the phenomenon being predicted. A churn model with 90% precision sounds good until you learn that only 5% of customers churn, and the model is identifying mostly non-churners as churners.

The Ecological Fallacy

The ecological fallacy is drawing conclusions about individuals based on aggregate data. If counties with higher ice cream consumption have higher rates of skin cancer, it does not follow that individuals who eat more ice cream are more likely to get skin cancer. Both correlate at the aggregate level with summer sun exposure.

In business analytics, ecological fallacies appear when country-level or market-level data is used to make inferences about individual customer behavior. Markets with higher per-capita income tend to have higher average spending. This does not mean that within those markets, higher-income individuals spend more--the aggregate relationship may not hold at the individual level.

Understanding the appropriate level of analysis for each question prevents ecological fallacies from corrupting individual-level decisions.

'The most dangerous result in data analysis is a confident wrong answer. Statistical errors rarely announce themselves. They arrive dressed as rigor, complete with coefficients and confidence intervals, and they drive organizations to invest in things that do not work and stop doing things that do.' -- Ron Kohavi, former Head of Experimentation at Microsoft and Amazon, from 'Trustworthy Online Controlled Experiments' (2020)

Building Systematic Defenses Against Analytics Errors

Individual awareness is necessary but insufficient. Organizations need structural safeguards that catch errors before they drive decisions.

Analysis review checklist. Before any analysis informs a material decision:

  1. Is the sample representative of the population the decision applies to?
  2. Is the sample large enough to detect effects of the magnitude that matter to the business?
  3. Were analysis methods chosen before examining the results?
  4. Have we looked for evidence that contradicts our conclusion?
  5. Are we claiming causation, or only correlation?
  6. What confounding variables could explain the observed relationship?
  7. Are results practically significant, not only statistically significant?
  8. Would these findings likely replicate on independent data?
  9. Are we examining metrics in their appropriate context and time frame?
  10. Has someone outside the project, without a stake in the outcome, reviewed the analysis?

Pre-registration is the practice of documenting what you plan to test, how you plan to analyze it, and what you will consider a meaningful result--before you access the data. Pre-registration prevents post-hoc hypothesis generation and makes p-hacking structurally harder. Academic journals increasingly require pre-registration for clinical and social science research. Leading product analytics teams at companies including Netflix, Google, and Microsoft have adopted internal pre-registration processes for major experiments.

Peer review of consequential analyses--review by someone not invested in a particular outcome--catches errors that analysts miss precisely because they are looking for confirmation rather than flaws. The reviewer's job is not to approve the analysis but to stress-test it: challenge the sample, question the causal inference, look for cherry-picking, and verify that the conclusion follows from the evidence.

Retrospective tracking closes the feedback loop. After a decision is made based on analytics, track whether the outcome matched the prediction. Did the A/B test winner actually perform better in production? Did the customer segment identified as high-value actually convert at the predicted rate? Organizations that systematically track prediction accuracy improve their analytical processes; organizations that don't perpetuate the same errors indefinitely.

Statistical literacy investment. The consumers of analytics--product managers, marketing leads, executives--make the final decisions. If they cannot identify p-hacking when they see it, cannot distinguish statistically significant from practically significant, and cannot ask the right questions about sample size and causal inference, the most technically rigorous analytics function in the company is ineffective. Training the consumers of analytics is at least as important as training the producers.

See also: Interpreting Data Correctly, Measurement Bias Explained, Dashboards That Actually Work

References

How Major Organizations Have Been Burned by Analytics Mistakes

The most instructive lessons in analytics errors come not from academic exercises but from real organizations making costly decisions on flawed analysis. These cases share a common structure: technically competent analysis, systematically wrong framing, and decisions that looked rational until the consequences arrived.

Sears Holdings and the Metrics That Accelerated Decline. Between 2008 and 2018, Sears Holdings--then the parent company of Sears and Kmart--operated under a management philosophy explicitly designed around data-driven performance measurement. CEO Eddie Lampert, a former hedge fund manager, restructured the company into approximately 30 competing business units, each with its own P&L accountability, and each measured on individual financial metrics. The intent was to surface which divisions were creating and destroying value. The result, documented in detail by journalist Lynn Cowan and in Shira Ovide's 2021 analysis in the New York Times, was Goodhart's Law at catastrophic scale: division heads optimized for their unit's metrics at the expense of the company's overall health. The appliance division refused to promote items that would benefit the clothing division's foot traffic. The tools division declined to share customer data with the auto services division. Each metric improved within its measured silo; the customer experience across all silos degraded. Lampert's analytics framework correctly measured individual division performance. It created incentives that destroyed the integrated consumer experience that retail depends on. Total Sears Holdings revenue fell from $53 billion in 2006 to under $17 billion by 2017, culminating in bankruptcy in 2018.

The Replication Crisis Reaches Business Psychology. The academic replication crisis--the systematic failure of published research findings to reproduce in independent studies--has direct business implications that most practitioners have not absorbed. Carol Dweck's "growth mindset" research, published in peer-reviewed journals and incorporated into employee training programs at hundreds of large companies, has faced significant replication challenges. A 2018 meta-analysis by Alexander Sisk and colleagues in Psychological Science examined 273 studies on growth mindset interventions and found effect sizes substantially smaller than the original research suggested, with many school-based interventions showing no significant effect. The consulting industry built training products around the original research; few adjusted their products when the replication evidence accumulated. Similarly, Amy Cuddy's "power posing" research--showing that adopting expansive postures increased testosterone and reduced cortisol, with implications for leadership presence training--failed to replicate in a 2017 study by Eva Ranehill and colleagues published in Psychological Science, with the hormonal effects specifically not reproducing. Business analytics practitioners who build training programs, hiring assessments, or management interventions on single published studies without checking replication status are building on potentially unstable foundations.

Johnson and Johnson's Talc Litigation: When Sampling Creates Dangerous Blind Spots. In 2018, Reuters published a lengthy investigative report revealing that Johnson and Johnson's internal scientists had known since the 1970s that their talcum powder products occasionally tested positive for asbestos contamination. The company had conducted extensive testing--thousands of samples over decades--but the sampling methodology was designed primarily to confirm product safety rather than to detect rare contamination events. When contamination occurred in a small fraction of samples, the overall testing program showed a low contamination rate. The company interpreted this as evidence of acceptable product safety. Statistically, a low average rate of contamination across thousands of samples can coexist with occasional batches that are dangerously contaminated. The sampling approach answered "what fraction of samples show contamination?" but not "can any particular batch harm a consumer?" The distinction matters enormously when the harm from a contaminated batch is severe and irreversible. By 2021, J&J had faced over 38,000 talc-related lawsuits, ultimately leading the company to discontinue US talc-based baby powder sales in 2020.

What Controlled Experimentation Research Reveals About Common Errors

Ron Kohavi, who built and led experimentation programs at Amazon, Microsoft, and LinkedIn before publishing his synthesis in Trustworthy Online Controlled Experiments (Cambridge University Press, 2020), documented the specific error rates in business A/B testing with unusual empirical rigor. Across thousands of experiments run at Microsoft's Bing division, Kohavi's team found that only approximately one-third of ideas championed by senior product managers and executives produced statistically significant improvements when tested. Roughly one-third showed no effect. One-third produced measurable harm. This distribution--which echoes findings from Google's experimentation team published by Diane Tang and Ya Xu in 2013--implies that organizational confidence in untested product changes is systematically overoptimistic, and that most organizations shipping features without controlled testing are degrading their products at roughly the same rate they're improving them.

Kohavi also documented the "novelty effect" as a widespread source of false positive A/B test results. When a new feature is introduced, early adopters and engaged users interact with it more than they will over time, simply because it is new and visually different. Tests that capture only the first week of behavior after a feature launch systematically overestimate the feature's long-term effect on engagement. Kohavi's prescription--running tests for at least two full weeks and preferably longer for features expected to show novelty effects--is not universally followed, meaning that the business literature on A/B testing success is likely inflated by novelty-contaminated results.

The statistical phenomenon of "regression to the mean" produces what Kohavi terms "ghost wins"--features that appear to win in a test and show degraded performance in subsequent months. If a test variant performed unusually well in the test period due to random variation favorable to the variant, post-ship measurements will show the variant underperforming its test results as the metric regresses toward its true underlying mean. Organizations that run A/B tests but do not track post-ship performance against test predictions cannot distinguish genuine improvement from ghost wins. Kohavi estimates that 10-20% of "winning" variants at typical technology companies are ghost wins that would be identified as such with six months of post-ship monitoring.

Frequently Asked Questions

What are the most common mistakes analysts make?

Top mistakes: (1) Confusing correlation with causation, (2) Ignoring sample size—drawing conclusions from too little data, (3) Cherry-picking data—selecting data supporting conclusions, (4) Incorrect aggregation—using wrong summary statistic (mean vs median), (5) Overlooking data quality issues, (6) Ignoring context—analyzing data without understanding what it represents, (7) Survivorship bias—only analyzing successful cases, (8) P-hacking—testing until finding significance, (9) Missing confounding variables, (10) Extrapolating beyond data—assuming patterns continue indefinitely. These mistakes persist because: they're easy to make, they often support desired conclusions, statistical training is limited, and pressure exists to find positive results. Good analysts actively work against these biases rather than assuming they're immune.

Why is ignoring sample size such a common mistake?

Sample size mistakes happen because: small samples feel like enough, people don't understand statistical power, and early patterns seem convincing. Problems with small samples: (1) Random variation looks like patterns, (2) Confidence intervals are wide, (3) Statistical tests lack power to detect real effects, (4) Outliers have disproportionate influence, (5) Subgroup analyses become meaningless. Example: A/B test with 50 users per variant showing 20% difference might just be noise; same difference with 5,000 users is likely real. Guidelines: minimum 30 for basic analyses, hundreds for reliable conclusions, thousands for A/B tests detecting small effects. Before analyzing, calculate required sample size for your question. If you don't have enough data, acknowledge conclusions are preliminary. Underpowered analyses waste time generating unreliable insights.

How does survivorship bias specifically mislead analysts?

Survivorship bias occurs when analyzing only entities that survived selection process. Misleading because: (1) Success factors appear clearer than they are—you're missing data about failures with same factors, (2) Risks are underestimated—dangerous strategies can produce survivors, (3) Advice becomes overly optimistic—'do what survivors did' ignores that many did same and failed, (4) Patterns are spurious—surviving entities might just be lucky. Examples: analyzing successful startup strategies without considering failed startups with identical strategies, studying traits of centenarians without comparing to those who died earlier, examining performance of active mutual funds without including closed funds. Prevention: explicitly seek data on failures/non-survivors, understand selection process creating your sample, be skeptical of success-only analyses, and consider base rates. Always ask: what am I not seeing because it didn't survive?

What is p-hacking and why is it problematic?

P-hacking (also: data dredging, fishing, significance mining) is trying multiple analyses until finding statistically significant results, then reporting only significant findings. Techniques: testing many hypotheses, adding/removing data points until significance emerges, trying different analysis methods, selective reporting of outcomes, flexible data collection (stop when significant). Why problematic: (1) Inflates false positive rate—5% significance level means 1 in 20 tests will be 'significant' by chance; test 20 things, you'll find something, (2) Results don't replicate, (3) Science and business decisions based on noise, (4) Destroys trust when discovered. Prevention: pre-register analysis plans, report all tests conducted, adjust significance thresholds for multiple testing, use holdout data for validation, prioritize effect sizes over p-values. Ethical issue: p-hacking is often unintentional—analysts don't realize they're doing it, which makes it more insidious.

How do analysts mistake noise for signal?

Mistaking noise for patterns happens through: (1) Small samples—random variation looks like trends, (2) Multiple testing—try enough analyses, you'll find 'significant' results by chance, (3) Post-hoc storytelling—creating narratives explaining random patterns, (4) Confirmation bias—noticing patterns confirming expectations, (5) Clustering illusion—seeing patterns in random data. Example: stock picker having three good years might just be lucky (survivorship bias + small sample), not skilled. Prevention: (1) Require adequate sample sizes, (2) Use statistical tests appropriately, (3) Replicate findings on new data, (4) Consider whether pattern is practically significant not just statistically, (5) Beware of patterns that seem too good to be true. Human brains evolved to detect patterns—sometimes too well, finding patterns in randomness. Statistical thinking requires overriding this instinct.

What are the risks of analyzing metrics in isolation?

Isolated metric analysis misses: (1) Tradeoffs—improving one metric degrades another (optimizing clicks reduces purchase quality), (2) Gaming—people optimize for measured metric at expense of actual goals, (3) Context—metric change might be due to external factors not actions taken, (4) Unintended consequences—focus on metric creates perverse incentives, (5) Holistic picture—need multiple metrics to understand full situation. Examples: optimizing for email open rates leads to clickbait subject lines but lower long-term engagement; measuring call center on handle time incentivizes rushing customers off phone regardless of resolution. Prevention: (1) Track complementary metrics simultaneously, (2) Understand relationships between metrics, (3) Consider what behaviors metrics incentivize, (4) Look at leading and lagging indicators together, (5) Regularly review metric systems for gaming. Remember Goodhart's Law: when measure becomes target, it ceases to be good measure.

How can analysts avoid these common mistakes systematically?

Systematic mistake prevention: (1) Pre-registration—decide on analysis before seeing data, (2) Checklist—review common pitfalls before finalizing analysis, (3) Peer review—have others check work, (4) Holdout validation—test conclusions on fresh data, (5) Document assumptions—explicit about what you're assuming, (6) Seek contradictory evidence—actively look for data against your hypothesis, (7) Report uncertainty—confidence intervals not just point estimates, (8) Consider alternative explanations—what else could explain this pattern? (9) Know limitations—explicit about what data doesn't tell you, (10) Statistical training—invest in proper methodology understanding. Culture matters: reward thorough careful analysis over fast confirmatory analysis, create psychological safety to report null results or contradictory findings, and build systematic review processes. Mistakes are inevitable; systematic checks catch most before they cause problems.