Measurement and Metrics Problems

In the early 2000s, a major hospital system began measuring emergency room wait times and publishing results publicly as part of a quality improvement initiative. Wait times dropped dramatically within months. The hospital administration celebrated. Patients were less enthusiastic. They were being triaged faster, but were spending longer waiting in treatment rooms after triage because the system had optimized entirely for the measured variable -- time to first clinical contact -- while neglecting the actual goal: total time to completing treatment. The metric improved substantially. The patient experience did not improve, and in some measures worsened, as the bottleneck simply shifted to an unmeasured stage of the process.

This pattern is so consistent that it has its own law. Charles Goodhart, a British economist, observed in a 1975 paper that "any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes." This became Goodhart's Law, typically paraphrased: when a measure becomes a target, it ceases to be a good measure. Organizations measure relentlessly -- generating dashboards, reports, and KPIs -- yet routinely find that their metrics tell a story increasingly disconnected from their actual performance. The problem is not a lack of data. It is a fundamental misunderstanding of what measurement can and cannot accomplish.


The Measurement Trap: How Organizations Fall Into It

Most organizations fall into measurement dysfunction in a predictable sequence. First, they identify something they want to improve -- customer satisfaction, developer productivity, sales effectiveness. Then, they select a metric that seems to represent that thing. Then, they set targets and tie incentives to the metric. Then, people optimize for the metric. Then, the original thing they wanted to improve has not actually improved, and may have gotten worse, while the metric looks excellent.

This sequence is not caused by bad intentions. It is caused by the nature of measurement itself: any metric is a proxy for the thing you actually care about, and the act of incentivizing optimization for the proxy drives a wedge between the proxy and the underlying reality. The proxy represents some aspects of the underlying reality; once people begin optimizing specifically for the proxy, they optimize for those aspects while allowing other aspects of the underlying reality to degrade.

"Not everything that counts can be counted, and not everything that can be counted counts." -- William Bruce Cameron

The hospital emergency room example is not an unusual case. The same pattern appears in software development (measuring lines of code or commits rather than value delivered), in sales (measuring bookings rather than retained, profitable customer relationships), in content marketing (measuring page views rather than audience behavior change), and in education (measuring test scores rather than genuine learning). The metric that was chosen to represent a complex outcome has become the target, and the complex outcome has become secondary.


Why Organizations Measure the Wrong Things

The Streetlight Effect

Organizations consistently measure what is easy to measure rather than what is important to measure. This bias -- the streetlight effect, named after the apocryphal joke about searching for lost keys under a streetlight because that is where the light is -- produces metrics that are precise and trackable but irrelevant to the organization's actual goals.

Lines of code are easy to count, so they become a proxy for developer productivity. But adding lines of code can reduce software quality, and the best engineering decisions often involve deleting code. Website page views are easy to track, so they become a proxy for content value. But page views reveal nothing about whether readers found the content useful, changed their understanding, or changed their behavior. Training hours completed are easy to log, so they become a proxy for employee capability development. But completing a training module does not indicate understanding, and understanding does not indicate behavioral change.

Easy to Measure (Often Chosen) Important but Hard to Measure (Often Neglected)
Lines of code written Software quality, maintainability, and technical debt
Calls made by sales representatives Relationship quality and pipeline health
Training hours completed Actual skill development and behavior change
Features shipped Customer problems actually solved
Employee satisfaction survey score Employee engagement and near-term retention risk
Revenue per quarter Long-term customer lifetime value and margin
Meeting attendance Decision quality and productive alignment
Social media followers Audience quality and content impact

The gap between these columns represents the measurement problem in miniature. Organizations know the right column matters more; they measure the left column because it is tractable, defensible, and produces numbers that can be put in a dashboard.

Vanity Metrics and the Psychology of Positive Reporting

Vanity metrics are numbers that increase over time, look impressive in presentations, and correlate with nothing that is actually actionable. Total registered users (rather than active users), cumulative revenue (rather than growth rate), social media followers (rather than engagement quality), and website traffic (rather than qualified traffic) are common examples. They provide a comforting sense of growth while obscuring whether the organization is actually healthy.

Vanity metrics persist because they serve important psychological and political functions that have nothing to do with informational value. They make teams feel good about their work. They make reports to leadership look positive and growth-oriented. They avoid the uncomfortable questions that arise when you measure things that might reveal problems. The organizational immune system protects vanity metrics precisely because replacing them with honest metrics would be politically threatening.

Example: Zynga, the social game company, was one of the most prominent examples of vanity metric culture. In its early growth phase, Zynga reported "Daily Active Users" aggressively in investor communications. The metric grew impressively through 2012. Less prominently reported: the company was paying more than the DAU metric was worth to acquire users through aggressive notification spam and viral mechanics. When the underlying economics became visible after the company's 2011 IPO, the stock declined approximately 75% from its peak within 18 months. The impressive DAU metric had masked the underlying business problems it was supposed to represent.


The Mechanisms of Metric Dysfunction

Gaming: Rational Behavior in Irrational Systems

When metrics are tied to incentives -- compensation, performance reviews, team recognition -- people predictably find ways to hit the numbers without necessarily delivering the intended underlying outcome. Call center agents measured on average handle time rush customers off the phone before problems are resolved. Teachers in systems with high-stakes test-based accountability teach to the specific test rather than to the broader learning goals the test was intended to measure. Salespeople measured purely on new bookings may discount heavily to accelerate signing deals that are not profitable for the organization.

Gaming is not dishonesty in most cases. It is rational behavior in response to the incentive structure that exists. People do what the system rewards. When the system rewards the metric rather than the underlying outcome, they optimize for the metric. This is why Steven Kerr's foundational 1975 paper, "On the Folly of Rewarding A, While Hoping for B," describes a pervasive organizational pathology rather than an unusual failure: organizations routinely design measurement systems that reward the wrong behaviors while hoping for different ones.

The structural solution to gaming is paired metrics -- for every metric you optimize, identify what might degrade as a result and measure that too. If you measure development velocity, also measure defect rate and customer-reported bugs. If you measure sales bookings, also measure 90-day retention and contract renewal rates. Paired metrics create natural constraints that prevent runaway optimization of any single variable by making the tradeoffs visible.

Metric Proliferation: Dashboard Sprawl

Organizations add metrics over time and rarely remove them. Each new initiative, each new leader, each new strategic concern adds its own measurement. The result is dashboard sprawl -- dozens or hundreds of metrics that no one can meaningfully track simultaneously, where genuine signals are buried in noise.

When everything is measured, nothing gets attention. Teams spread their focus across too many targets, and metrics that actually matter receive the same weight as metrics that are irrelevant or redundant. The discipline of choosing a small number of metrics and accepting the tradeoffs of ignoring the rest is harder than it sounds but essential for effective measurement. Jeff Bezos's famous one-page or two-page meeting memo requirement at Amazon is a management tool for the same underlying problem: if you cannot explain the most important things about a situation in a constrained format, you do not have a clear view of what matters.

The practice of metric audits -- quarterly or annual reviews of every metric the organization tracks to ask which ones are driving decisions and which are reporting theater -- consistently reveals a large proportion of metrics that have outlived their usefulness, were never clearly tied to decisions in the first place, or have created perverse incentives that outweigh their informational value.

Lagging Without Leading: Measurement as Autopsy

Most organizational metrics are lagging indicators -- they measure what has already happened. Revenue, customer churn, employee turnover, customer satisfaction scores, and product quality defects are all lagging indicators. By the time they change, the underlying causes that produced the change have been operating for weeks or months. The measurement confirms what happened; it does not provide time to intervene.

Leading indicators -- metrics that predict future outcomes -- provide the time to intervene before the lagging outcomes materialize. Customer engagement patterns predict churn weeks or months before the customer cancels. Code review velocity and scope creep indicators predict delivery timeline risk before deadlines are missed. Employee sentiment surveys track morale in ways that predict turnover before resignations occur.

The challenge of leading indicators is that the connection between leading indicators and future outcomes must be established empirically, and the connection is typically less direct and less certain than the connection between lagging indicators and past outcomes. Organizations often use lagging indicators because the relationship between the metric and the outcome is definitional rather than probabilistic: revenue is revenue, churn is churn. The relationship between a leading indicator and the outcome it is supposed to predict is a hypothesis that must be validated.


Output vs. Outcome: The Most Consequential Measurement Failure

One of the most persistently damaging measurement failures is confusing outputs with outcomes. Outputs are what you produce: features shipped, articles published, training sessions delivered, calls completed, reports generated. Outcomes are the results that matter: customer behavior change, business problems solved, value created for users, revenue generated, capability developed.

The confusion persists because outputs are easy to count and directly controllable, while outcomes are harder to measure and influenced by many factors beyond the producing team's control. Teams naturally gravitate toward output metrics because they feel fair -- you can control how many features you ship -- and achievable. But a content team that publishes 100 articles while solving zero audience problems has produced outputs without outcomes. An engineering team that ships 20 features while reducing active user engagement has produced activity without value.

The question that separates useful measurement from metric theater is: for each output metric, can you complete the sentence "When this metric moves in the desired direction, we know [specific outcome] has improved, because we have verified that relationship through [specific evidence]"? If the answer is "we assume the relationship," the output metric may be no more useful than counting lines of code.

This distinction has direct implications for how product development, content creation, and organizational development are measured. The shift from measuring activity to measuring impact consistently reveals that effort is more evenly distributed than impact -- small amounts of work produce large outcomes, while large amounts of work produce small outcomes -- and that the highest-leverage opportunities are in identifying and expanding the small number of outputs that drive the large outcomes.


Designing Measurement Systems That Work

The Minimal Viable Metrics Set

The most effective measurement systems are small. Rather than attempting to track everything relevant, they identify the three to five metrics that are most directly connected to the outcomes that matter most, and track those rigorously. Everything else is noise that competes for attention with the signal.

Identifying the minimal viable metrics set requires answering a specific question: if this metric moved significantly in the wrong direction, would you change what you are doing? If the answer is yes, the metric belongs in the core set. If the answer is "probably not" or "we'd want to understand why first," the metric may be worth tracking as secondary information but not as a primary performance indicator.

Starting from Outcomes

The most durable measurement systems start from outcomes rather than from available data. The sequence: identify the outcome that matters (customer problems solved, capability developed, revenue retained), identify leading indicators that predict that outcome (engagement patterns, skill assessment results, renewal signals), identify behaviors that drive those leading indicators, and finally identify the activities that drive those behaviors. This outcome-first sequence produces a measurement system where every metric connects visibly to an outcome that matters.

The alternative -- starting from available data and asking which metrics it enables -- produces the streetlight effect: measuring what is visible rather than what matters.

Metric Review Cycles

Measurement systems require maintenance. Metrics that were well-designed for last year's goals may be misaligned with this year's goals. Metrics that were accurate proxies before optimization pressure may have become gamed. Metrics that were predictive leading indicators may have lost their predictive validity as the underlying system changed.

Quarterly metric reviews should ask: Is this metric still connected to our current goals? Is there evidence that it is being gamed or that the relationship between metric and outcome has changed? Has it created any perverse incentives we have observed? Should it be retired, replaced, or supplemented with a countervailing metric?

This review discipline is as important as the initial metric selection because the conditions that made a metric appropriate change over time.


The Meta-Problem: Measuring Whether Measurement Works

Organizations rarely assess whether their measurement systems are themselves producing better decisions. They measure products, processes, and people, but they do not measure whether their measurements are changing organizational behavior in directions that improve outcomes.

The ultimate test of any metric is behavioral: when this metric moves, do relevant people change what they are doing in response? And when they change what they are doing in response, do outcomes improve? If a metric moves and no one changes their behavior, it is not a useful metric. If a metric changes behavior but those behavior changes do not improve outcomes, the metric is creating work without producing value.

This behavioral test is rarely applied. Organizations report metrics without asking whether the metrics are driving decisions. They produce dashboards without asking whether anyone is using the dashboards to change priorities. The absence of this feedback loop on the measurement system itself is why dysfunctional metrics can persist for years: there is no systematic process for evaluating whether measurement is working.

For frameworks that connect measurement to decision-making rigorously, the data-driven decision making literature provides the structural tools for closing this loop. The organizations that measure effectively are those that treat measurement as a decision-support tool rather than a reporting requirement -- designing measurement around decisions they need to make, rather than reporting metrics that might eventually influence decisions.


What Research Shows About Measurement and Metrics Problems

Jerry Muller at Catholic University of America published "The Tyranny of Metrics" in 2018, synthesizing historical and contemporary evidence for what he calls "metric fixation" -- the assumption that everything important can and should be measured, and that measured performance can substitute for professional judgment. Muller documented case studies across higher education, medicine, policing, military operations, and financial services, finding a consistent pattern: institutions that adopted high-stakes metric systems showed initial performance improvements on measured variables followed by degradation of unmeasured dimensions of quality. In his analysis of American hospital systems that adopted patient satisfaction scores as primary performance metrics in the 2000s, Muller found that hospitals that improved satisfaction scores showed concurrent increases in opioid prescription rates (prescribing pain medication was the fastest way to improve satisfaction scores) and that this pattern was visible in publicly available data by 2013. His analysis concluded that metric fixation is not a recent phenomenon but follows predictably from the structural incentive created whenever a proxy is made into a high-stakes target.

Donald Campbell at Northwestern University formulated what became known as "Campbell's Law" in a 1979 paper in Evaluation and Program Planning titled "Assessing the Impact of Planned Social Change." Campbell's principle states: "The more any quantitative social indicator is used for social decision-making, the more subject it will be to corruption pressures and the more apt it will be to distort and corrupt the social processes it is intended to monitor." Campbell derived this principle from his analysis of federal social programs in the 1960s and 1970s that had used test scores, crime statistics, and health indicators as performance targets. In every case he examined, once the indicator became a high-stakes target tied to funding or institutional survival, measured performance on the indicator diverged from the underlying condition it was supposed to represent. Campbell's Law operates independently of individual intent -- it is a systemic property of any measurement system in which the indicator can be manipulated more easily than the underlying outcome it represents. The law was formalized before Goodhart's Law but is often cited together with it because both independently arrived at the same conclusion from different disciplinary starting points.

Robert Kaplan at Harvard Business School and David Norton published "The Balanced Scorecard: Translating Strategy into Action" in 1996, having originally introduced the Balanced Scorecard framework in a 1992 Harvard Business Review article. The framework emerged from Kaplan and Norton's observation that financial metrics alone were systematically leading organizations to underinvest in capabilities that would drive future performance -- specifically customer relationships, internal process quality, and employee learning and development. Their survey of 275 portfolio managers found that the ability to execute strategy was ranked more important than the quality of the strategy itself in 80% of responses, but that nearly all measurement systems tracked financial outputs rather than strategic execution inputs. The Balanced Scorecard proposed measuring performance across four perspectives simultaneously: financial, customer, internal business process, and learning and growth. Kaplan and Norton's 2001 follow-up research across companies that had implemented the framework found that BSC adopters showed 73% higher shareholder value growth over five years compared to matched non-adopters in the same industries, though critics have noted these comparisons suffer from selection bias since early BSC adopters were disproportionately high-performing organizations to begin with.

W. Edwards Deming, the American statistician and management consultant who played a central role in post-war Japanese manufacturing improvement, articulated a critique of numerical quota systems in his 1986 book "Out of the Crisis" that anticipated modern concerns about metric dysfunction. Deming's Point 11 of his 14 Points for Management explicitly called for eliminating numerical quotas for workers and management, arguing that quotas "guarantee inefficiency and high cost" because they set a ceiling on performance for workers who reach quota and cause workers who fall short to compromise quality to hit numerical targets. Deming's consulting work with Japanese manufacturers from the 1950s onward generated extensive documentation of quota elimination experiments: plants that removed piecework quotas and replaced them with process quality improvement systems showed both higher output and lower defect rates within 18 months in over a dozen documented cases. Deming's empirical case rested on the observation that numerical targets in manufacturing invariably led workers to optimize for the measured output at the expense of downstream quality -- precisely the pattern Goodhart's Law predicts.


Real-World Case Studies in Measurement and Metrics Problems

Wells Fargo's fraudulent account scandal, which became public in September 2016, represents one of the clearest documented cases of Goodhart's Law operating at corporate scale. Wells Fargo had implemented a cross-selling measurement system in the 2000s that tracked the average number of products held per customer, with a target of eight products per household and sales quotas enforced through termination for underperforming employees. The Consumer Financial Protection Bureau's 2016 enforcement action, resulting in a $185 million fine, found that bank employees had created approximately 1.5 million fraudulent deposit accounts and 565,000 fraudulent credit card accounts to meet quotas -- the measured metric had been hit through manipulation of the metric itself rather than through genuine customer relationship development. A subsequent independent review commissioned by the Wells Fargo board and published in 2017 found that the incentive structure was identified as a potential fraud risk by internal audit as early as 2004, but that the metric remained the primary performance indicator because it correlated well with short-term revenue. The scandal ultimately cost Wells Fargo over $3 billion in fines and settlements across multiple regulatory bodies.

New York City's "stop and frisk" policing program, at its peak involving approximately 700,000 stops per year in 2011, provided a well-documented case of policing metric dysfunction. The program used number of stops as a primary measure of officer productivity, with performance evaluations tied to stop counts. Academic analysis by Bernard Harcourt and Tracey Meares at the University of Chicago Law School, published in 2011, found that the stop-and-frisk metric created systematic distortions: officers who could not identify probable cause would conduct stops in neighborhoods where stop volume was highest (to hit targets) rather than where criminal activity was concentrated, producing a divergence between measured activity and crime prevention outcomes. The NYPD's own data, released following a 2013 federal court ruling that found stop and frisk unconstitutional in its implementation, showed that 88% of stops resulted in no arrest and no summons -- the activity metric was largely disconnected from the outcome it nominally served. Following a program reduction that began in 2013, violent crime in New York City did not increase as proponents had predicted; it continued its prior downward trend, suggesting the stops had provided minimal crime reduction return.

Google's adoption of Objectives and Key Results (OKRs) beginning in 1999, as documented by John Doerr in "Measure What Matters" (2018), provides a case study of metric system design that attempts to avoid Goodhart's Law through explicit separation of stretch goals from performance evaluation. Google's OKR implementation required employees to set quarterly objectives with measurable key results, but explicitly benchmarked success at 60-70% achievement of key results -- with 100% achievement treated as a signal that targets were set too conservatively. By decoupling OKRs from compensation and performance reviews, Google attempted to prevent the gaming behavior that emerges when measurement directly drives compensation. Doerr's analysis of Google's first decade of OKR usage found that teams using OKRs showed measurably higher goal attainment on company-level priorities (as assessed by quarterly board reviews) than comparable teams without structured OKRs, but that the benefit was concentrated in teams where the OKR review conversations were substantive rather than mechanical. Google's internal people analytics team found in a 2014 study that teams with managers who had substantive OKR discussions showed 40% higher employee satisfaction scores than teams where OKRs existed but reviews were perfunctory.

The UK National Health Service's experience with hospital wait time targets illustrates the speed with which metric gaming can emerge even in public sector organizations with ostensibly aligned incentives. Following the Labour government's 2000 NHS Plan, which introduced a target of maximum 4-hour emergency department wait times, NHS England tracked A&E performance against the target with public reporting and managerial accountability. A 2012 report by the National Audit Office found that several hospital trusts had responded to the target by creating internal "breach avoidance" systems: patients who were approaching the 4-hour threshold were "streamed" to new assessment queues that restarted their wait clock, transferred to different units that were not tracked under the same metric, or admitted to hospital beds to stop the emergency clock even when admission was clinically unnecessary. An academic analysis published in the British Medical Journal in 2015 found that breach rates across NHS trusts showed suspicious discontinuities precisely at the 4-hour mark, with a statistically improbable underrepresentation of patients waiting between 240 and 250 minutes -- a pattern consistent with systematic clock manipulation rather than genuine flow improvement.


References

Frequently Asked Questions

What are the most common measurement and metrics problems?

Measuring what's easy vs. what matters, Goodhart's Law (when measure becomes target it ceases to be good measure), vanity metrics disconnected from outcomes, too many metrics creating noise, lagging indicators without leading ones, and gaming metrics.

Why do vanity metrics persist despite being unhelpful?

They're easy to move and report positive growth, provide false sense of progress, require less critical thinking to interpret, look good in presentations, and avoid uncomfortable questions about actual value creation. Going up feels good even when meaningless.

How does Goodhart's Law undermine performance measurement?

When metric becomes target, people optimize for metric at expense of actual goal: teaching to test vs. learning, code quantity vs. quality, call volume vs. customer satisfaction, engagement vs. value. Measurement changes behavior in ways that defeat measurement purpose.

What's the difference between output metrics and outcome metrics?

Outputs: what you produce (features shipped, content created, calls made). Outcomes: results that matter (user behavior change, business impact, problem solved). Problem: outputs are easier to measure but outcomes are what actually matter.

Why do organizations measure too many things?

Fear of missing something important, inability to prioritize, different stakeholders demanding different metrics, easier to add metrics than remove them, and belief that more data equals better decisions. Result: noise obscures signal, nothing gets attention.

How do you know if you're measuring the wrong things?

Indicators: teams gaming metrics, metrics moving while business problems persist, no clear connection between metrics and strategy, metrics don't inform actual decisions, and people can't explain why metric matters or what action it should drive.

What makes a metric actually useful vs. misleading?

Useful metrics: clearly tied to goal, actionable (suggests what to do), resistant to gaming, balanced with countervailing metrics, understandable to those using it, and regularly reviewed for continued relevance. If can't act on it, it's reporting theater.