Analytics Mistakes Explained: Common Errors That Lead to Wrong Conclusions
Google once ran an A/B test on 41 shades of blue for their toolbar links. Marissa Mayer, then VP of Search Products, championed the approach. The data showed that a specific shade of blue outperformed others, generating an estimated $200 million in additional annual ad revenue. Data-driven decision making at its finest, right?
Not everyone agreed. Doug Bowman, Google's visual design lead, resigned over the culture of testing every tiny detail. "I can't operate in an environment like that," he wrote. The problem was not that Google used data. The problem was that Google mistook measuring everything for understanding anything.
This tension sits at the heart of analytics mistakes: the difference between having data and knowing what it means. Organizations drown in metrics while making decisions based on flawed analysis, biased samples, misunderstood statistics, and conclusions that crumble under scrutiny.
The most dangerous analytics mistakes are not obvious errors. They are subtle, systematic, and seductive---they produce results that look right, feel right, and are completely wrong.
When Correlation Masquerades as Causation
The single most pervasive analytics mistake is treating correlation as causation. Two variables that move together do not necessarily have a causal relationship.
The Mechanics of False Causation
Correlation without causation occurs through three primary mechanisms:
1. Confounding variables -- A third factor drives both observed variables
Example: Ice cream sales and drowning deaths correlate strongly. Nobody believes ice cream causes drowning. Summer heat (the confound) increases both. Yet in business analytics, equally absurd causal claims go unchallenged because the confounding factor is less obvious.
2. Reverse causation -- The assumed direction of causation is backward
Example: Companies that invest heavily in employee wellness programs tend to have healthier employees. Does the wellness program cause health? Or do already-healthy companies invest in wellness programs because they can afford to? Both directions are plausible.
3. Coincidence at scale -- With enough variables, spurious correlations are mathematically guaranteed
Tyler Vigen's "Spurious Correlations" project found that US spending on science, space, and technology correlates with suicides by hanging, suffocation, and strangulation (r = 0.998 from 1999 to 2009). The correlation is real; the causal relationship is nonexistent.
Real Consequences of Correlation Confusion
In 2008, Google launched Google Flu Trends, predicting flu outbreaks based on search query data. Initial results were impressive---the system predicted CDC reports two weeks ahead. By 2013, the model was predicting double the actual flu cases. Media coverage of flu had increased search queries, and the model confused concern about flu with actual flu incidence. Google Flu Trends was quietly discontinued.
The lesson: correlated data can work as a proxy until conditions change. When you do not understand the causal mechanism, your model is fragile.
Sample Size: The Silent Killer of Reliable Insights
Why Small Samples Deceive
Human brains are pattern-detection machines. Show someone a coin that lands heads three times in a row, and they begin constructing narratives about biased coins. The same instinct infects analytics.
Common sample size failures:
- A/B tests with insufficient traffic -- A test showing a 20% conversion improvement with 50 visitors per variant is noise, not signal. The same test with 5,000 per variant produces reliable conclusions.
- Customer surveys with low response rates -- A survey sent to 10,000 customers with 200 responses captures the opinions of people who respond to surveys, not customers generally.
- Quarterly business reviews based on small counts -- "Enterprise sales increased 50% quarter-over-quarter" sounds impressive until you realize it went from 4 deals to 6.
Statistical Power: What Most Analysts Ignore
Statistical power is the probability that a test correctly detects a real effect. Most business A/B tests are grossly underpowered.
To detect a 5% relative improvement in a 3% conversion rate with 80% power and 95% confidence, you need approximately 78,000 visitors per variant. Most companies run tests with a fraction of that and celebrate results that are indistinguishable from random noise.
Evan Miller, creator of a widely-used sample size calculator, has argued that the standard practice of "peeking" at test results daily and stopping when significance appears inflates false positive rates from the nominal 5% to 30% or higher.
The Minimum Viable Sample
As practical guidelines:
- Below 30 observations: statistical tests are unreliable
- 100-300 observations: basic pattern detection becomes possible
- 1,000+ observations: subgroup analysis starts being meaningful
- 10,000+ observations: small effect sizes become detectable---and this is where you must ask whether statistically significant effects are practically significant
Survivorship Bias: Studying Only the Winners
During World War II, the US military examined bombers returning from missions to determine where to add armor. Bullet holes clustered on wings, fuselage, and tail. The obvious conclusion: reinforce those areas.
Mathematician Abraham Wald at Columbia University's Statistical Research Group saw the flaw immediately. The military was studying only planes that survived. The missing bullet holes---the areas without damage on returning planes---indicated where hits were fatal. Planes hit in engines and cockpits never returned. Wald recommended armoring the engines, not the wings.
Modern Survivorship Bias in Analytics
This error pervades business analysis:
- Startup success analysis -- Studying Y Combinator graduates who became unicorns ignores the thousands with identical strategies that failed. Paul Graham himself has noted that founder characteristics he associates with success are also found in many failed startups.
- Mutual fund performance -- Morningstar tracks active fund performance, but closed funds disappear from the data. The surviving funds show better average returns than the true average of all funds that ever existed.
- Customer satisfaction surveys -- Measuring satisfaction among current customers ignores customers who already churned. The satisfied customers remaining create an artificially positive picture.
- Employee engagement scores -- Disengaged employees often leave before surveys, inflating scores.
Combating Survivorship Bias
- Explicitly ask: "What am I not seeing because it did not survive?"
- Include failure data in every analysis
- Track cohorts from the beginning, not just the survivors
- Compare your sample to the full original population
Understanding how measurement methods affect results is essential for recognizing survivorship bias and other systematic distortions in data.
Cherry-Picking: The Art of Self-Deception
Cherry-picking is selecting data that supports a predetermined conclusion while ignoring contradictory evidence. It can be deliberate fraud or, more commonly, unconscious confirmation bias.
How Cherry-Picking Manifests
Selective time periods -- A marketing team reports that website traffic increased 40% in Q3. They neglect to mention that traffic dropped 50% in Q2 due to a site migration and the "increase" was partial recovery.
Subgroup hunting -- An A/B test shows no significant overall effect, so the analyst segments by age, gender, device, geography, browser, time of day, and new vs. returning users. After testing 20 subgroups, one shows significance at p < 0.05. By chance alone, you would expect one false positive in 20 tests.
Selective reporting -- A product team runs seven different metrics for a feature launch. Six show no improvement. One shows improvement. The launch presentation highlights the single positive metric.
The Texas Sharpshooter Fallacy
Named after the joke about a Texan who fires bullets into a barn wall and then paints a target around the tightest cluster, this fallacy describes finding patterns in random data and constructing narratives to explain them.
In analytics, this manifests as post-hoc storytelling. A pattern appears in the data, and the analyst retroactively constructs a plausible-sounding explanation. The story feels compelling, but it was not a hypothesis tested---it was a coincidence explained.
Systematic Prevention
- Pre-register analysis plans before accessing data
- Report all metrics, not just favorable ones
- Adjust for multiple comparisons (Bonferroni correction or similar)
- Use holdout validation -- Split data into exploration and confirmation sets
- Require peer review of all analyses before decision-making
P-Hacking: Manufacturing Statistical Significance
P-hacking (also called data dredging or significance fishing) is the practice of manipulating analysis---often unconsciously---until statistically significant results appear.
How P-Hacking Works
Researchers at the University of Pennsylvania demonstrated p-hacking with a study showing that listening to "When I'm Sixty-Four" by the Beatles made participants 1.5 years younger. The study was real, published, and intentionally p-hacked to expose the problem.
Their techniques:
- Measured multiple dependent variables, reported only the significant one
- Tested multiple conditions, reported only the significant comparison
- Added covariates (father's age, mother's age) until significance appeared
- Stopped data collection when significance first appeared
Each technique individually seems defensible. Combined, they produce "significant" results from noise.
The Replication Crisis Connection
P-hacking is a primary driver of the replication crisis in science. In 2015, the Open Science Collaboration attempted to replicate 100 published psychology studies. Only 36% produced significant results on replication. In fields ranging from cancer biology to economics, similar replication failures have been documented.
Business analytics suffers the same problem but receives less scrutiny. Companies make product decisions based on "significant" A/B tests that were never replicated, never pre-registered, and often analyzed with multiple techniques until significance appeared.
Antidotes to P-Hacking
- Set alpha levels, sample sizes, and analysis methods before collecting data
- Report effect sizes and confidence intervals, not just p-values
- Require replication on independent data before major decisions
- Adopt Bayesian methods that are less susceptible to stopping-rule violations
- Create a culture where null results are valued, not punished
Analyzing Metrics in Isolation: Goodhart's Trap
Goodhart's Law states: "When a measure becomes a target, it ceases to be a good measure."
Charles Goodhart, a British economist, observed this principle in monetary policy. When central banks target specific monetary indicators, economic actors change behavior to optimize those indicators, destroying their usefulness as measures.
Business Examples of Goodhart's Law
Call center handle time: A telecommunications company measured call center performance by average handle time (AHT). Agents began rushing calls, transferring customers, and hanging up prematurely. Customer satisfaction plummeted. AHT looked great. The actual service was terrible.
Lines of code: IBM reportedly tracked programmer productivity by lines of code in the 1980s. Developers wrote verbose, redundant code. A function that should be 10 lines became 50. Productivity appeared to soar while actual output declined.
Social media engagement: Facebook optimized for engagement metrics, which their algorithms achieved by surfacing divisive, emotionally provocative content. Engagement was maximized. So was misinformation, polarization, and user mental health deterioration. Frances Haugen's 2021 whistleblower disclosures documented how Facebook's internal research showed these harms while the company continued optimizing for engagement.
Student test scores: Teaching to standardized tests improved scores while degrading actual learning. Campbell's Law (a corollary of Goodhart's) predicts this: quantitative social indicators become corrupted when used for decision-making.
Avoiding the Trap
- Track complementary metrics -- Never optimize a single number. Pair engagement with satisfaction. Pair velocity with quality. Pair revenue with retention.
- Use input metrics alongside output metrics -- Outputs can be gamed; inputs are harder to fake
- Rotate metrics periodically -- Prevents gaming strategies from calcifying
- Maintain qualitative assessment -- Regular human evaluation alongside quantitative tracking
- Ask what behaviors your metrics incentivize -- If the easiest way to hit a target is to do something counterproductive, the metric will drive that behavior
Understanding how to interpret data within its full context prevents isolated metric analysis from leading to destructive optimization.
Simpson's Paradox: When Groups Reverse Trends
Simpson's Paradox occurs when a trend that appears in several groups of data reverses or disappears when the groups are combined.
The Classic Berkeley Admissions Case
In 1973, UC Berkeley was sued for gender bias in graduate admissions. Overall admission rates showed:
- Men: 44% admitted
- Women: 35% admitted
This looked like clear discrimination. But when examined department by department, women were admitted at higher rates than men in most departments. The paradox arose because women applied disproportionately to competitive departments with low acceptance rates, while men applied to less competitive departments with high acceptance rates.
Business Implications
Simpson's Paradox appears regularly in business analytics:
- An overall conversion rate decline might mask improvement in every individual channel---if traffic shifted toward lower-converting channels
- A drug might show positive effects in every subgroup but negative effects overall if the subgroups are unbalanced
- Average salary increases by gender might show men gaining more, while within every department women gain more---because more women were hired into lower-paying departments
Prevention
- Always examine data at multiple levels of aggregation
- Be suspicious when aggregated results tell a different story than disaggregated results
- Understand the composition of your groups before drawing conclusions
- Use causal reasoning, not just statistical patterns, to interpret results
Building Systematic Defenses Against Analytics Errors
Individual awareness is insufficient. Organizations need structural safeguards.
The Analytics Review Checklist
Before any analysis informs a decision, verify:
- Is the sample representative of the population we care about?
- Is the sample large enough to support the conclusions?
- Were analysis methods chosen before examining results?
- Have we looked for contradictory evidence?
- Are we confusing correlation with causation?
- Have we considered confounding variables?
- Are results practically significant, not just statistically?
- Would these findings replicate on new data?
- Are we analyzing metrics in appropriate context?
- Has someone outside the project reviewed the analysis?
Creating a Culture of Honest Analysis
Netflix's former VP of Data and Analytics, Hy Phan, described building a data culture where teams were rewarded for finding that their hypothesis was wrong. This inverts the typical incentive: instead of seeking confirmation, analysts seek truth.
Practices that support honest analysis:
- Reward null results -- Proving something does not work saves time and money
- Peer review -- All consequential analyses reviewed by someone not invested in the outcome
- Pre-registration -- Document what you plan to test before seeing data
- Retrospectives -- Regularly review past decisions and whether the data supported the actual outcome
- Statistical literacy -- Invest in training for everyone who consumes analytics, not just analysts
The goal is not to eliminate mistakes---that is impossible. The goal is to catch mistakes systematically before they drive decisions, and to learn from the mistakes that slip through.
References
- Vigen, Tyler. "Spurious Correlations." tylervigen.com. https://www.tylervigen.com/spurious-correlations
- Simmons, Joseph P., Nelson, Leif D., and Simonsohn, Uri. "False-Positive Psychology." Psychological Science, 2011. https://journals.sagepub.com/doi/10.1177/0956797611417632
- Open Science Collaboration. "Estimating the Reproducibility of Psychological Science." Science, 2015. https://www.science.org/doi/10.1126/science.aac4716
- Miller, Evan. "How Not to Run an A/B Test." EvanMiller.org. https://www.evanmiller.org/how-not-to-run-an-ab-test.html
- Wald, Abraham. "A Method of Estimating Plane Vulnerability Based on Damage of Survivors." Statistical Research Group, Columbia University, 1943. https://people.math.umass.edu/~lavine/Book/book.pdf
- Bickel, P.J., Hammel, E.A., O'Connell, J.W. "Sex Bias in Graduate Admissions: Data from Berkeley." Science, 1975. https://www.science.org/doi/10.1126/science.187.4175.398
- Goodhart, Charles. "Problems of Monetary Management: The UK Experience." Papers in Monetary Economics, Reserve Bank of Australia, 1975.
- Ioannidis, John P.A. "Why Most Published Research Findings Are False." PLOS Medicine, 2005. https://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.0020124
- Kahneman, Daniel. Thinking, Fast and Slow. Farrar, Straus and Giroux, 2011.
- Wheelan, Charles. Naked Statistics: Stripping the Dread from the Data. W.W. Norton, 2013.