Metrics That Destroyed Performance: Case Studies of How Measurement Backfired, Incentives Corrupted Behavior, and Numbers Replaced Judgment
In 2016, Wells Fargo, one of the largest and most respected banks in the United States, admitted that its employees had created approximately 3.5 million fake bank and credit card accounts without customers' knowledge or consent. The number was later revised upward. Employees had opened accounts, transferred funds, created fake email addresses, and even forged signatures--not because they were inherently dishonest but because they were measured, evaluated, rewarded, and punished based on a single metric: the number of accounts opened per customer.
Wells Fargo's leadership had set a target of eight products per customer--a figure captured in the internal slogan "Going for Gr-eight." Branch employees who met this target received bonuses. Employees who did not meet the target were reprimanded, placed on performance improvement plans, and ultimately fired. In this environment, the rational response for an employee who could not persuade customers to open more accounts was to create accounts without their knowledge. The metric demanded results; the metric got results. The results happened to be fraudulent.
The Wells Fargo scandal is the most dramatic recent example of a pattern that plays out in organizations of every size and sector: metrics, designed to improve performance, instead destroy the performance they are meant to measure. This destruction follows predictable paths--gaming, distortion, neglect of unmeasured dimensions, and corruption of the underlying activity--that have been documented across industries, countries, and centuries. Understanding these paths is essential for anyone who designs, implements, or is subject to performance measurement systems.
The Fundamental Problem: Goodhart's Law and Campbell's Law
Two closely related principles capture the fundamental problem with metrics that become targets:
Goodhart's Law, formulated by British economist Charles Goodhart in 1975: "When a measure becomes a target, it ceases to be a good measure." Goodhart observed this phenomenon in the context of monetary policy--when central banks targeted specific monetary indicators, market participants changed their behavior in ways that made those indicators unreliable--but the principle applies universally.
Campbell's Law, formulated by social scientist Donald Campbell in 1979: "The more any quantitative social indicator is used for social decision-making, the more subject it will be to corruption pressures and the more apt it will be to distort and corrupt the social processes it is intended to monitor." Campbell's formulation is more explicit about the mechanism: metrics do not simply become unreliable; they actively corrupt the processes they measure.
The two laws together describe a dynamic that operates in every measurement system:
- A metric is chosen to represent an important outcome
- The metric is used to evaluate and incentivize performance
- People optimize for the metric rather than for the underlying outcome
- The metric improves while the underlying outcome stagnates or deteriorates
- The gap between the metric and the reality it was supposed to represent widens until the metric becomes meaningless or actively misleading
This dynamic is not a failure of individual character. It is a structural feature of measurement systems that occurs whenever metrics are used as targets, regardless of the intentions of the people who design the system or the character of the people who operate within it.
Wells Fargo: How Sales Metrics Drove Systemic Fraud
The Metric System
Wells Fargo's cross-selling metric was a cornerstone of the company's retail banking strategy. The theory was straightforward: customers who had more Wells Fargo products (checking accounts, savings accounts, credit cards, mortgages, investment accounts) were more profitable, more loyal, and less likely to leave for a competitor. Cross-selling was a legitimate business strategy, and measuring it made strategic sense.
The problem was in the execution. The metric was not merely tracked; it was weaponized:
Aggressive targets. The "eight accounts per customer" target was ambitious relative to industry norms. Most banks averaged between three and four products per customer. The target created constant pressure on branch employees to sell products that customers did not need and had not requested.
High-stakes consequences. Employees who met targets received bonuses that constituted a significant portion of their compensation. Employees who did not meet targets were subjected to daily and weekly pressure from managers, placed on performance improvement plans, publicly shamed in team meetings, and ultimately fired. The pressure was relentless: managers tracked hourly progress toward daily goals and berated employees who fell behind.
No balancing metrics. The cross-selling metric was not balanced by complementary metrics that might have detected abuse--customer satisfaction scores, account closure rates, complaint rates, or metrics tracking whether opened accounts were actually used. The measurement system measured the quantity of accounts opened without measuring the quality, legitimacy, or customer impact of those openings.
How Wells Fargo Metrics Drove Fraud
The fraud at Wells Fargo was not a conspiracy hatched by senior leadership. It was the emergent result of a measurement system that created overwhelming incentives to game the metric:
Ghost accounts. Employees opened accounts without customer knowledge by using existing customer information on file. They created checking accounts, savings accounts, and credit cards that customers never requested and often never discovered until they received unexpected fees or noticed unfamiliar accounts on their statements.
Pinning. Employees created personal identification numbers (PINs) for debit cards issued on unauthorized accounts, a practice known as "pinning." This allowed the accounts to appear active and avoided the automatic closure triggers that would have flagged them.
Bundling. When customers came in for a single product, employees would open multiple products simultaneously, sometimes burying the additional products in a stack of paperwork that customers signed without reading carefully.
Fake email addresses. To meet requirements that online banking be set up for new accounts, employees created fake email addresses (like noname@wellsfargo.com) and attached them to unauthorized accounts.
The scale of the fraud--millions of fake accounts over a period of years--demonstrates that this was not isolated misconduct by a few bad employees. It was a systemic response to a metric system that made fraud the rational (if unethical) strategy for survival. Employees who refused to create fake accounts were fired for underperformance. The metric system selected for employees willing to commit fraud and eliminated those who were not.
The consequences were severe. Wells Fargo paid $3 billion in fines and settlements. CEO John Stumpf was forced to resign and was personally fined $17.5 million. The company's reputation suffered lasting damage. And the metric that was supposed to improve performance had instead produced criminal behavior at massive scale.
Soviet Nail Factory Metrics: The Classic Parable
When Measured by Weight, Factories Made Heavy Useless Nails
The Soviet planned economy provides some of the most vivid examples of metric dysfunction in history. Under central planning, factories were assigned production quotas and evaluated based on their achievement of those quotas. The metrics were simple, quantitative, and high-stakes: factories that met their quotas were rewarded; those that did not were punished.
The problem was that the metrics, by necessity, simplified complex production goals into single quantitative measures--and the simplification created opportunities for gaming that systematically undermined the intended outcomes.
The nail factory parable, widely cited in economics and management literature, illustrates the dynamic:
When the nail factory's output was measured by weight (tons of nails produced), the factory produced the largest, heaviest nails possible--railroad spikes and structural bolts that were easy to produce in bulk and weighed a lot per unit. The quota was met. The economy needed small nails for construction, furniture, and general use, but the metric incentivized heavy nails.
When the nail factory's output was measured by quantity (number of nails produced), the factory produced the smallest, thinnest nails possible--tiny brads and pins that could be stamped out rapidly in enormous numbers. The quota was met. The economy needed a variety of nail sizes, but the metric incentivized the smallest, easiest-to-produce nails.
When the authorities tried to specify both weight and quantity targets, factories found other dimensions to optimize: using the cheapest, lowest-quality materials; skipping quality control steps; or producing nails to exact minimum specifications that technically met the target while being functionally inferior.
The Soviet nail factory demonstrates a fundamental principle: any single metric can be gamed by optimizing the measured dimension while degrading unmeasured dimensions. The more narrowly the metric is defined, the more room exists for optimization that satisfies the metric without satisfying the underlying need.
This principle is not limited to Soviet central planning. It operates in every context where metrics are used as targets: modern corporations, government agencies, healthcare systems, educational institutions, and technology companies all produce their own versions of the Soviet nail factory problem.
Teaching to the Test: How Education Metrics Narrowed Learning
How Teaching-to-Test Hurt Education
The No Child Left Behind Act (NCLB), signed into law in the United States in 2002, made standardized test scores the primary metric for evaluating schools, teachers, and educational progress. Schools whose students did not achieve "adequate yearly progress" on standardized tests faced escalating consequences: public identification as "failing," mandatory restructuring, staff replacement, and potential closure or takeover.
The consequences of making test scores the dominant educational metric were predictable and well-documented:
Curriculum narrowing. When test scores in mathematics and reading determine a school's fate, subjects not tested--science, social studies, art, music, physical education--receive less time and fewer resources. Research by the Center on Education Policy found that 44 percent of school districts reduced time spent on non-tested subjects after NCLB's implementation, with some districts cutting science, social studies, and arts instruction by up to 75 percent.
Teaching to the test. When tests measure specific types of knowledge in specific formats, instruction shifts toward those specific types and formats. Teachers spend time on test-taking strategies, practice tests, and drills on tested content rather than on deeper understanding, critical thinking, or creative exploration that tests do not measure. The instruction optimizes for the metric (test performance) rather than the outcome (genuine learning).
Focus on "bubble kids." Schools facing accountability pressure learned to focus resources on students near the proficiency threshold--"bubble kids" who could be pushed over the line from "not proficient" to "proficient" with targeted intervention. Students well above the threshold (who would pass anyway) and students well below the threshold (who would not pass even with intervention) received less attention. The metric incentivized maximizing the number of students above the cut-off, not maximizing learning for all students.
Testing scandals. In extreme cases, the pressure of test-based accountability produced outright cheating. The most prominent example was the 2011 Atlanta Public Schools cheating scandal, in which 178 teachers and administrators across 44 schools were found to have changed students' answers on standardized tests. Superintendent Beverly Hall, who had been named National Superintendent of the Year in 2009 based on the district's (fraudulent) test score improvements, was indicted on racketeering charges. Eleven educators were convicted of racketeering, making false statements, and theft by extortion.
The Atlanta scandal was the most dramatic, but investigations found similar testing irregularities in Washington D.C., Philadelphia, El Paso, Columbus, and other districts. The metric created incentives powerful enough to drive educators--people who had entered the profession to help children--to commit fraud that harmed the students they were supposed to serve.
The Cobra Effect: When Incentive Metrics Worsen the Problem
What Was the Cobra Effect in Colonial India
The cobra effect is named after an apocryphal (but illustrative) story from British colonial India. The British government in Delhi, concerned about the number of venomous cobras in the city, offered a bounty for every dead cobra delivered to the authorities. The policy initially succeeded: people killed cobras and collected the bounty.
But enterprising individuals quickly realized that breeding cobras was more profitable than hunting them. Cobra farms were established to breed snakes specifically for the bounty. When the British government discovered the farms and canceled the bounty program, the cobra farmers released their now-worthless breeding stock into the wild. The net result: Delhi had more cobras after the bounty program than before.
The cobra effect illustrates a specific pattern of metric destruction: when a metric creates a financial incentive to produce the thing being measured, people may produce the measured thing artificially rather than achieving the underlying goal. The bounty measured dead cobras as a proxy for cobra reduction, but measuring dead cobras was not the same as measuring cobra reduction. The metric could be satisfied by increasing the cobra population, then harvesting it.
Similar dynamics have been documented elsewhere:
- In Hanoi under French colonial rule, a rat bounty program paid for rat tails (to prove kills without requiring disposal of carcasses). Enterprising residents caught rats, cut off their tails, and released the still-living rats to breed more bounty-eligible offspring.
- In some jurisdictions, towing companies paid per tow have been caught disabling vehicles to increase tow volumes.
- In healthcare systems that pay per procedure, providers may perform unnecessary procedures to increase revenue.
Hospital Wait Time Metrics: Gaming at the Expense of Patient Care
How Hospital Wait Time Metrics Backfired
The United Kingdom's National Health Service (NHS) implemented a target requiring that 98 percent of emergency department patients be seen, treated, and either admitted or discharged within four hours of arrival. The target was intended to reduce dangerously long wait times that had been documented in NHS emergency departments.
The metric produced measurable improvement in reported wait times--but the improvement was substantially achieved through gaming rather than genuine process improvement:
Ambulance queuing. When emergency departments were full and accepting new patients would start the four-hour clock on patients who could not be seen quickly, some hospitals held ambulances in parking lots rather than admitting patients. Patients remained in ambulances, technically not yet "arrived" at the emergency department, while they waited for capacity. The metric clock did not start, but patients were still waiting--in the back of an ambulance rather than in the waiting room.
Corridor patients. Some hospitals moved patients from the emergency department to hospital corridors or assessment units after four hours, counting them as "seen" even when they had not received meaningful treatment. The patients were no longer in the emergency department for metric purposes, but they were not receiving appropriate care either.
Reclassification. Patients who arrived at the emergency department but whose conditions could be reframed as not requiring emergency care were reclassified and redirected, removing them from the four-hour target population.
Preemptive discharge. Some patients were discharged prematurely to meet the four-hour target, only to return later with the same or worsened conditions, increasing overall emergency department workload and potentially harming patients.
Research by Bevan and Hood documented these gaming strategies and concluded that the four-hour target had created "a system in which targets have become more important than patients." The metric was supposed to serve the goal of better patient care; instead, the goal of better patient care became subordinate to the goal of meeting the metric.
Code Lines as Productivity Metric: Software's Measurement Trap
What Happened with the Code Lines Metric
In the early decades of software development, a common productivity metric was lines of code (LOC): the number of lines of source code produced by a programmer in a given period. The appeal was understandable--lines of code is easy to measure, easy to compare across programmers, and seems to represent productive output.
The metric incentivized exactly the wrong behavior:
Verbose code over concise code. A programmer measured by lines of code has no incentive to write concise, elegant solutions. A function that accomplishes a task in 10 lines is worth less, by the metric, than a function that accomplishes the same task in 50 lines. The metric rewards verbosity and punishes the programming skill of expressing solutions concisely.
Copy-paste over abstraction. When the same code is needed in multiple places, good software engineering practice is to write it once and call it from multiple locations (abstraction). But abstraction reduces line count, while copying and pasting the same code repeatedly increases it. The metric incentivizes the practice that makes software harder to maintain.
Avoiding deletion. Experienced programmers know that deleting unnecessary code is often more valuable than writing new code--it reduces complexity, improves readability, and reduces the surface area for bugs. But deleting code reduces line count. A programmer measured by LOC is punished for improving the codebase through simplification.
Bill Gates reportedly said: "Measuring software productivity by lines of code is like measuring progress on an aircraft by how much it weighs." The observation captures the absurdity: the metric measures a quantity (code volume) that has no reliable relationship to the quality (software that works well) it is supposed to represent.
The lines-of-code metric has been largely abandoned in modern software development, replaced by outcome-based metrics (features delivered, bugs fixed, customer problems solved) and process metrics (cycle time, deployment frequency, change failure rate). But the lesson it teaches is universal: measuring the easily quantifiable proxy rather than the actual outcome creates incentives to produce more of the proxy without producing more of the outcome.
Vanity Metrics in Startups: Looking Good While Failing
Why Do Vanity Metrics Harm Startups
Vanity metrics--measurements that look impressive but do not drive decisions or reflect genuine value creation--are particularly dangerous for startups because they can create a false sense of progress that delays recognition of fundamental problems.
The term was popularized by Eric Ries in The Lean Startup to describe metrics that make founders and investors feel good without providing actionable information about whether the business is actually working.
Total registered users. A startup with 1 million registered users sounds successful. But if only 50,000 of those users have been active in the last month, and only 5,000 are paying customers, the impressive headline number masks a product that most users tried once and abandoned. The registration number goes up (it can only go up--nobody "unregisters"), creating an illusion of growth even as the business may be stagnating or declining.
Page views and downloads. A mobile app with 500,000 downloads sounds like a hit. But if the average session duration is 30 seconds and 90 percent of users never open the app a second time, the download number represents marketing effectiveness, not product value. Measuring downloads without measuring retention, engagement, and satisfaction creates a false picture of product-market fit.
Gross merchandise volume without unit economics. An e-commerce startup that reports $10 million in sales volume sounds impressive. But if the cost of acquiring each customer is $100, the average order value is $50, and the margin on each order is $5, the company is losing $95 on each customer. The gross volume metric looks like success while the underlying economics are catastrophic.
The danger of vanity metrics is not that they are false--the numbers are typically accurate. The danger is that they create a narrative of progress that substitutes for genuine understanding of whether the business is working. Founders who focus on vanity metrics avoid the uncomfortable confrontation with leading indicators that might reveal fundamental problems requiring difficult strategic decisions.
| Metric | Why It Seems Good | What It Actually Measures | Better Alternative |
|---|---|---|---|
| Total registered users | Big number, always growing | Historical sign-ups, including abandoned accounts | Monthly active users, retention rate |
| Page views | High volume suggests popularity | Visits, including bounces and accidental clicks | Engagement rate, time on site, conversion |
| App downloads | Suggests widespread adoption | One-time installation decisions | Daily/weekly active users, session length |
| Social media followers | Large audience reach | Accumulated followers, including bots and inactive | Engagement rate, click-through rate |
| Total revenue | Financial growth narrative | Gross income without cost context | Unit economics, customer lifetime value, CAC |
| Lines of code | Suggests productivity | Code volume regardless of quality | Features shipped, bugs fixed, cycle time |
Policing Metrics: When Crime Statistics Shape Police Behavior
Policing provides some of the most consequential examples of metric destruction because the stakes involve public safety, civil rights, and criminal justice.
CompStat and crime statistics. New York City's CompStat system, introduced in 1994, used crime statistics to hold precinct commanders accountable for crime rates in their districts. The system was credited with contributing to dramatic crime reductions in New York in the 1990s and 2000s, and was widely adopted by police departments across the United States.
But investigations revealed that some precincts gamed the statistics:
- Downgrading crimes. Officers reclassified serious crimes as less serious offenses to improve crime statistics. A robbery might be recorded as a larceny; a burglary might be recorded as criminal mischief. The reported crime rate fell, but the actual crime rate did not.
- Refusing to take reports. Officers discouraged victims from filing reports, particularly for lower-level crimes, by telling them that nothing could be done or that the paperwork was not worth the effort. Unreported crimes do not appear in statistics.
- Stop-and-frisk quotas. Under pressure to demonstrate proactive policing, some departments implemented informal quotas for stops, summonses, and arrests. Officers conducted stops of civilians not because they had reasonable suspicion of criminal activity but because they needed to meet their numbers. This practice disproportionately affected Black and Latino communities and was found to be unconstitutional in the 2013 case Floyd v. City of New York.
The policing metric dynamic illustrates a particularly dangerous form of metric corruption: when the people being measured have the power to manipulate the data itself, the gap between the metric and reality can become enormous without external detection.
Healthcare: When Volume Metrics Override Patient Outcomes
In healthcare systems that pay providers based on the volume of services delivered (fee-for-service), the metric of services delivered creates incentives that can conflict with patient welfare:
Unnecessary procedures. When a hospital is paid per procedure, performing more procedures generates more revenue regardless of whether the procedures improve patient outcomes. Research by the Dartmouth Atlas of Health Care found enormous geographic variation in procedure rates for the same conditions, suggesting that local practice patterns and financial incentives rather than medical necessity drive procedure volumes.
Readmission penalties. When Medicare introduced penalties for high hospital readmission rates, some hospitals responded by keeping patients in observation status rather than formally admitting them, reclassifying them in ways that kept them out of the readmission metric. Other hospitals extended initial stays to reduce the probability of readmission within the measurement window, potentially keeping patients hospitalized longer than medically necessary.
Patient satisfaction scores. Hospitals measured on patient satisfaction scores face incentives that can conflict with good clinical practice. Patients who receive antibiotics they request (but do not need) are more satisfied than patients who are told antibiotics are inappropriate. Patients who receive opioid prescriptions for pain are more satisfied than patients offered non-pharmacological pain management. The satisfaction metric can incentivize clinically inappropriate care that makes patients happy in the short term while harming them in the long term.
How Can You Design Metrics That Don't Destroy?
The case studies above might suggest that metrics are inherently destructive. They are not. Metrics are powerful tools for understanding performance, directing attention, and creating accountability. The problem is not measurement itself but measurement systems that are poorly designed, that lack safeguards against gaming, and that substitute metric achievement for genuine understanding of performance.
Measure Outcomes, Not Just Outputs
The most robust defense against metric gaming is measuring outcomes (the results that matter) rather than outputs (the activities that are supposed to produce results). When Wells Fargo measured accounts opened (output) rather than customer satisfaction and relationship depth (outcomes), the metric could be gamed through fake accounts. When education systems measure test scores (output) rather than genuine learning, critical thinking, and skill development (outcomes), the metric can be gamed through teaching to the test.
Outcome metrics are harder to game because they are closer to the thing you actually care about. A customer who was genuinely satisfied cannot be manufactured by opening a fake account. A student who genuinely understands mathematics cannot be produced by test-prep drilling alone.
Consider Unintended Consequences Before Implementation
Before implementing any metric, ask: If people optimize for this metric at the expense of everything else, what would happen? This "pre-mortem for metrics" can identify predictable gaming strategies and allow for the design of complementary metrics that guard against them.
Balance Multiple Metrics
Single metrics are the most vulnerable to gaming because they create a single dimension of optimization. When metrics are balanced--measuring quantity and quality, speed and accuracy, efficiency and customer satisfaction--gaming one metric at the expense of another becomes visible through the decline of the complementary metric.
Maintain Human Judgment
The most important safeguard against metric destruction is maintaining the role of human judgment in evaluating performance. Metrics should inform judgment, not replace it. When metrics become the sole basis for evaluation, the humans disappear from the system and the metrics become the reality. When metrics inform a human evaluator who also considers context, quality, relationships, and unmeasured dimensions, the evaluation is richer and more resistant to gaming.
Monitor for Gaming
Every metric system should include mechanisms for detecting gaming: anomalous patterns in the data, discrepancies between metrics and qualitative assessments, unexplained improvements that are not accompanied by visible changes in behavior or process. When gaming is detected, the response should be to fix the measurement system, not merely to punish the individuals who gamed it. If the system incentivizes gaming, the system is the problem.
Separate Measurement from High-Stakes Consequences
The intensity of gaming is proportional to the stakes attached to the metric. When a metric determines whether you get fired, you will do whatever it takes to meet the metric. When a metric is used for learning, improvement, and informed discussion--without high-stakes consequences--the incentive to game is dramatically reduced. Google's explicit separation of OKRs from compensation is designed to exploit this principle: by reducing the stakes, the system reduces the incentive to game, allowing the metrics to function as genuine learning tools rather than as targets to be achieved by any means available.
Metrics are like fire: essential for civilization, catastrophic when uncontrolled. The organizations and institutions that use metrics effectively are not those that measure more or measure less, but those that measure wisely--understanding the limitations of measurement, anticipating the behavioral responses that measurement creates, and maintaining the human judgment that measurement is meant to inform, not replace.
References and Further Reading
Muller, J.Z. (2018). The Tyranny of Metrics. Princeton University Press. https://press.princeton.edu/books/hardcover/9780691174952/the-tyranny-of-metrics
Goodhart, C.A.E. (1984). "Problems of Monetary Management: The U.K. Experience." In Monetary Theory and Practice. Macmillan. https://en.wikipedia.org/wiki/Goodhart%27s_law
Campbell, D.T. (1979). "Assessing the Impact of Planned Social Change." Evaluation and Program Planning, 2(1), 67-90. https://doi.org/10.1016/0149-7189(79)90048-X
Stumpf, J. (2016). Testimony Before the US Senate Committee on Banking, Housing, and Urban Affairs. https://www.banking.senate.gov/hearings/an-examination-of-wells-fargo
Bevan, G. & Hood, C. (2006). "What's Measured Is What Matters: Targets and Gaming in the English Public Health Care System." Public Administration, 84(3), 517-538. https://doi.org/10.1111/j.1467-9299.2006.00600.x
Ries, E. (2011). The Lean Startup. Crown Business. https://theleanstartup.com/
Ravitch, D. (2010). The Death and Life of the Great American School System. Basic Books. https://en.wikipedia.org/wiki/The_Death_and_Life_of_the_Great_American_School_System
Center on Education Policy. (2008). "Instructional Time in Elementary Schools: A Closer Look at Changes for Specific Subjects." https://www.cep-dc.org/
O'Neil, C. (2016). Weapons of Math Destruction. Crown. https://en.wikipedia.org/wiki/Weapons_of_Math_Destruction
Wennberg, J.E. (2010). Tracking Medicine: A Researcher's Quest to Understand Health Care. Oxford University Press. https://www.dartmouthatlas.org/
Strathern, M. (1997). "'Improving Ratings': Audit in the British University System." European Review, 5(3), 305-321. https://doi.org/10.1002/(SICI)1234-981X(199707)5:3%3C305::AID-EURO184%3E3.0.CO;2-4
Eterno, J.A. & Silverman, E.B. (2012). The Crime Numbers Game: Management by Manipulation. CRC Press. https://www.routledge.com/The-Crime-Numbers-Game/Eterno-Silverman/p/book/9781439846964
Doerr, J. (2018). Measure What Matters. Portfolio. https://www.whatmatters.com/
Ariely, D. (2010). "You Are What You Measure." Harvard Business Review. https://hbr.org/2010/06/column-you-are-what-you-luftballons
Kerr, S. (1975). "On the Folly of Rewarding A, While Hoping for B." Academy of Management Journal, 18(4), 769-783. https://doi.org/10.5465/255378