Incentive Design Failures Explained

Q: "How did commission-only sales incentives backfire?"

"Created pressure for short-term sales over customer value, encouraged overselling, and led to customer churn and reputation damage."

Q: "What happened with pay-for-performance in education?"

"Teachers taught to test, focused on borderline students, and sometimes cheated—metrics gaming rather than better teaching."

Q: "Why did unlimited vacation policies fail?"

"Paradoxically led to less vacation—social pressure, ambiguity about acceptable amount, and competitive signaling through not taking time."

Q: "How did stock options harm long-term thinking?"

"Incentivized short-term stock price over company health, encouraged accounting tricks, and sometimes led to fraud."

Q: "What was wrong with stack ranking?"

"Microsoft and others used forced distribution—created competition over collaboration, discouraged risk-taking, and harmed morale."

Q: "Why do bug bounties sometimes fail?"

"Can incentivize creating bugs to collect bounties, discourage disclosing without payment, or attract wrong kind of attention."

Q: "What makes incentive design difficult?"

"People optimize for what you measure not what you want, gaming is creative, and unintended consequences are hard to predict."

Q: "How can you design better incentives?"

"Align with actual goals, expect gaming, balance multiple metrics, maintain human judgment, and iterate based on observed behavior."

What Are Incentive Design Failures?

An incentive design failure occurs when a reward or penalty system produces behavior that is contrary to its intended goal - typically because people optimize for what is measured rather than the underlying outcome the measure was meant to represent. This pattern, known formally as Goodhart's Law, is universal: whenever an incentive is poorly calibrated to the real goal, rational actors will find ways to satisfy the metric while undermining the purpose.

Incentive design failures appear across every domain, from corporate compensation and public policy to education and healthcare.

In 1902, the French colonial government in Hanoi faced a rat problem. The city's sewers teemed with rats spreading disease. Desperate for a solution, officials announced a bounty: citizens would receive payment for every dead rat they delivered-specifically, for every rat tail presented as proof.

The program seemed successful initially. Thousands of rat tails flooded in daily. The government paid out considerable sums. Officials congratulated themselves on the effective incentive structure.

Then inspectors discovered something unexpected: rats running around Hanoi without tails. Citizens had figured out that catching rats, cutting off tails, and releasing them alive produced more long-term income than killing rats. The rat population actually increased-farmers even began breeding rats for the bounty. The incentive had created exactly the opposite outcome from what was intended.

This phenomenon-perverse incentives producing behaviors contrary to goals-wasn't unique to colonial Hanoi. It's a universal pattern. Whenever incentives are designed carelessly, humans optimize for what's measured rather than what's intended, gaming systems in creative ways designers never anticipated.

This is Goodhart's Law in action: when a measure becomes a target, it ceases to be a good measure.

From corporate sales commissions destroying customer relationships to educational testing mandates undermining actual learning, from unlimited vacation policies paradoxically reducing time off to stock options encouraging accounting fraud-incentive design failures follow recognizable patterns across domains.

This article analyzes real incentive design failures: the specific mechanisms that caused them to backfire, why they're so common, patterns that predict failure, lessons for designing better incentive systems, and frameworks for avoiding common pitfalls.

Incentive Failure Type	Classic Example	Mechanism	Outcome
Gaming the measure	Hanoi rat bounty	Optimize metric, ignore goal	Rat population increased; farmers bred rats
Fraud from impossible targets	Wells Fargo sales quotas	Legitimate path closes; fraud becomes rational	3.5 million fake accounts, $3B in fines
Teaching to the test	No Child Left Behind testing	Narrow metric replaces broad goal	Schools cut arts/recess to boost test scores
Short-term over long-term	Stock option compensation	Immediate reward vs. sustainable performance	Earnings manipulation, short-termism
Crowd out intrinsic motivation	Paying for blood donation	Extrinsic reward replaces intrinsic motivation	Donation rates fell after payment introduced
Perverse risk-taking	Bail-out expectations	Remove downside risk from decisions	Excessive leverage, 2008 financial crisis

Case Study 1: Wells Fargo's Sales Quotas-Fraud at Scale

The Incentive: Wells Fargo employees had aggressive sales quotas-open 8 accounts per customer. Compensation, bonuses, and job security tied to hitting targets.

The Intent: Increase customer engagement and cross-selling, driving revenue growth and customer relationships.

What Actually Happened:

Between 2011-2016, employees created 3.5 million fake accounts without customer knowledge or consent:

Opening checking accounts customers never requested
Issuing credit cards customers didn't know about
Transferring funds between accounts to trigger fees
Forging customer signatures
Creating fake email addresses and PINs

Why It Backfired:

1. Impossible targets: 8 accounts per customer wasn't achievable legitimately for most employees. Choice: fail targets (lose job) or cheat.

2. Short-term pressure: Monthly quotas created constant urgency. No time for building genuine relationships.

3. No quality metrics: Only quantity mattered. Whether customers wanted or used accounts was irrelevant to incentives.

4. Punitive culture: Branch managers publicly humiliated employees missing targets, creating fear-driven environment.

5. Asymmetric risk: Employees who cheated might keep jobs; those who didn't definitely lost them.

"The incentive structure at Wells Fargo was so aggressive that it essentially made fraud the rational choice for survival. That's not a few bad employees-that's a broken system." - Elizabeth Warren, US Senator, Senate Banking Committee hearing, 2016

Outcome:

$3 billion in fines
5,300 employees fired
CEO resigned
Massive reputation damage
Criminal investigations
Customers harmed by fees, credit impacts

Lesson: When incentives create existential pressure without ethical guardrails or quality measures, people will game the system to survive.

Case Study 2: Microsoft's Stack Ranking-Innovation Killer

The Incentive: Microsoft implemented "stack ranking"-managers forced to rate employees on curve, with fixed percentages in each category (top 20%, middle 70%, bottom 10%). Bottom 10% typically fired or denied bonuses.

The Intent: Identify and reward top performers, weed out poor performers, create meritocracy.

What Actually Happened:

The system became infamous for destroying Microsoft's culture:

Perverse behaviors:

Avoided joining strong teams: Would rather be best performer on weak team than middle performer on strong team
Sabotaged colleagues: Direct reports were competitors for limited "top performer" slots
Hoarded information: Helping colleagues made them competitive threats
Risk aversion: Ambitious projects with failure risk threatened rankings
Politics over performance: Focused on impression management vs. actual work
Talent flight: Top performers left for companies without forced ranking

Impact on innovation:

Teams fragmented rather than collaborated
People worked on safe, incremental projects
Long-term investments (like cloud computing initially) avoided
Internal competition outweighed external competition

Why It Backfired:

1. Zero-sum game: One person's gain was another's loss. Created competition instead of collaboration.

2. Forced distribution assumption: Assumed every team has poor performers. Reality: strong teams might have no poor performers, weak teams might have many.

3. Metrics gaming: Performance became about ratings management, not actual contribution.

4. Short-term focus: Quarterly or annual reviews incentivized visible short-term work over long-term value creation.

"Stack ranking was the most destructive process inside of Microsoft, something that drove out untold numbers of employees." - Kurt Eichenwald, Vanity Fair, 2012

Outcome:

Microsoft stagnated for years
Lost mobile and cloud leadership initially
Toxic culture that took years to repair
Satya Nadella eliminated stack ranking shortly after becoming CEO (2013)
Post-elimination: Culture improved, innovation accelerated, stock price tripled

Lesson: Competitive incentives within teams destroy collaboration. Forced distributions assume uniform talent distribution that rarely exists.

Case Study 3: Teaching to the Test-Educational Metric Fixation

The Incentive: No Child Left Behind (2001) and subsequent policies tied school funding, teacher bonuses, and job security to standardized test scores.

The Intent: Improve educational outcomes, ensure accountability, close achievement gaps.

What Actually Happened:

Perverse behaviors:

1. Teaching to test: Curriculum narrowed to tested subjects (reading, math). Art, music, science, social studies reduced or eliminated.

2. Strategic student focus: Teachers concentrated on "bubble kids" (borderline pass/fail). High performers and struggling students neglected-couldn't change test pass rates.

3. Gaming tactics:

Suspending low-performing students on test days
Encouraging weak students to stay home
Pushing struggling students into special education (exempted from testing)
Extended test-prep replacing actual teaching

4. Outright cheating: Atlanta, Washington D.C., and other districts had widespread teacher/administrator cheating-changing answer sheets, giving answers during tests.

Why It Backfired:

1. Single metric dominance: Test scores became sole measure of success, crowding out actual learning.

2. Campbell's Law: "The more any quantitative social indicator is used for social decision-making, the more subject it will be to corruption pressures and the more apt it will be to distort and corrupt the social processes it is intended to monitor."

3. Measurement substitution: Test scores are proxy for learning, not learning itself. When proxy becomes target, relationship breaks. This is precisely why measurement changes behavior in ways designers do not intend.

4. Ignores complexity: Learning is multidimensional. Single number can't capture reading comprehension, critical thinking, creativity, problem-solving, collaboration, etc.

"Campbell's Law is not a curiosity. It is an iron law of social measurement: the higher the stakes attached to any indicator, the more that indicator will be corrupted." - Donald T. Campbell, social scientist

Outcome:

Students learned test-taking, not critical thinking
Education quality didn't improve (scores rose, actual knowledge didn't)
Teacher morale plummeted
Curriculum narrowing harmed well-rounded education
Cheating scandals damaged trust
Policies gradually rolled back due to failures

Lesson: Single-metric optimization produces gaming and narrow optimization at expense of broader goals. Proxies break when they become targets.

Case Study 4: Cobra Effect-British India's Snake Bounty

The Incentive: British colonial rule in Delhi faced venomous cobra problem. Government offered bounty for every dead cobra delivered (similar to Hanoi rats).

The Intent: Reduce cobra population, protect citizens.

What Actually Happened:

Phase 1: Initially successful-many cobras killed and delivered.

Phase 2: Enterprising individuals started breeding cobras for bounty income. More profitable than catching wild cobras.

Phase 3: Government discovered breeding, eliminated bounty program.

Phase 4: Breeders released now-worthless captive cobras. Cobra population larger than before program started.

Why It Backfired:

1. Incentivized supply: Paid for dead cobras without verifying they were wild vs. bred.

2. No consideration of second-order effects: Didn't anticipate breeding response. First-order thinking asked "will people catch cobras for money?" but second-order thinking would have asked "will people breed cobras for money?"

3. Exit strategy lacking: Ending bounty caused worse problem than original.

4. Open-ended payment: No cap on payments or verification of sources.

"Cobra Effect" becomes term: Describes solutions that make problems worse through perverse incentives.

Lesson: Incentives without consideration of gaming mechanisms and second-order effects can make problems worse. People will create supply of whatever you pay for.

Case Study 5: Unlimited Vacation-Paradox of Choice

The Incentive: Tech companies (Netflix, others) replaced accrued vacation days with "unlimited vacation"-take as much time off as needed, no tracking.

The Intent: Treat employees like adults, reduce HR overhead, eliminate vacation liability on balance sheets, attract talent with "unlimited" perk.

What Actually Happened:

Employees took less vacation with unlimited policies than with fixed allocations:

Why?

1. Ambiguity anxiety: No clear norm for "acceptable" amount. Fear taking "too much."

2. Social comparison: Competitive workplaces created negative signaling-taking vacation implied less commitment.

3. Tragedy of the commons: Limited vacation was individual right. Unlimited vacation felt like taking from company goodwill.

4. Loss of endowment effect: Fixed vacation felt like "mine." Unlimited vacation felt like asking for permission each time.

5. Manager discretion: Approval now subjective. Employees worried about relationships.

"Removing a constraint doesn't always increase freedom. Sometimes the constraint was the thing that made the freedom feel real." - Daniel Pink, author of Drive

Outcome:

Employee burnout increased
Some companies reverted to fixed vacation
Others added mandatory minimums (Kickstarter: required 18 days)
Accounting benefit remained (no vacation liability) but employee benefit disappeared

Why It Backfired:

Intent: More freedom

Reality: More anxiety about boundaries and social norms

Lesson: Removing constraints doesn't always increase freedom. Sometimes structure provides psychological safety. Incentives work differently when framed as taking from commons vs. using personal allocation.

Case Study 6: Commission-Only Sales-Burning Customer Relationships

The Incentive: Sales reps paid purely on commission-no base salary, compensation entirely from closed deals.

The Intent: Align sales incentives with revenue, motivate aggressive selling, pay only for results.

What Actually Happened:

Perverse behaviors:

1. Overselling: Selling customers products they don't need to hit quotas

2. Misrepresentation: Exaggerating product capabilities to close deals

3. High-pressure tactics: Aggressive closing techniques damaging brand

4. Cherry-picking customers: Focusing only on easy, high-value deals; ignoring relationship building

5. Churn acceleration: Closing bad-fit customers who cancelled quickly

6. Zero long-term thinking: Only current month's commission mattered

Why It Backfired:

1. Misaligned timescales: Sales rep cared about closing; company cared about customer lifetime value.

2. Adverse selection: Commission-only attracted people optimizing for short-term income, not customer relationships.

3. Reputation damage: Aggressive tactics harmed brand, making future sales harder.

4. Retention ignored: No incentive for customer success post-sale. High churn.

This is a textbook example of a feedback loop working against organizational goals: short-term commission pressure drives behavior that reduces long-term revenue, which increases pressure for more short-term sales, which drives more destructive behavior.

Outcome:

Customer complaints
High churn rates
Damaged brand reputation
Legal issues from misrepresentation
Many companies shifted to base + commission models

Lesson: Pure transaction incentives ignore relationship and long-term value. Misaligned time horizons between individual and organization create agency problems.

Case Study 7: Stock Options-Short-Term Thinking and Fraud

The Incentive: Executive compensation tied to stock price through options-profit when stock price rises.

The Intent: Align executives with shareholders, incentivize long-term value creation.

What Actually Happened:

1990s-2000s corporate scandals:

Enron:

Executives with massive stock option compensation
Incentivized showing ever-increasing profits
Used accounting fraud to inflate earnings
Stock soared on false numbers
Collapsed spectacularly, destroying shareholder value

WorldCom:

Similar pattern-stock options incentivized earnings growth
$11 billion accounting fraud to meet targets
Bankruptcy, criminal convictions

Broader effects:

1. Short-termism: Options vest over 3-5 years. Executives optimized for stock price during vesting period, not long-term health.

2. Earnings management: "Beat estimates" mentality led to aggressive (sometimes fraudulent) accounting.

3. Risk-taking: Stock options are "heads I win, tails I don't lose much"-incentivized excessive risk.

4. Stock buybacks over investment: Repurchasing stock boosts price short-term but reduces capital for R&D, workers, infrastructure.

Why It Backfired:

1. Proxy gaming: Stock price is proxy for company health. When it becomes target, relationship breaks.

2. Asymmetric incentives: Unlimited upside, limited downside (options worthless if price falls, but executives don't lose money).

3. Time horizon mismatch: Options vest short-term; company health is long-term.

"The stock option has turned out to be one of the most effective devices ever invented for encouraging executives to manage short-term at the expense of long-term." - Warren Buffett, Berkshire Hathaway shareholder letter

Lesson: Financial incentives tied to easily-manipulated metrics encourage gaming. Asymmetric risk profiles incentivize excessive risk-taking.

Why Incentive Design Is So Hard

Common patterns causing failure:

Reason 1: Goodhart's Law

"When a measure becomes a target, it ceases to be a good measure."

Mechanism:

Metric initially correlates with goal (test scores ~ learning)
Make metric a target (pay teachers for test scores)
People optimize metric directly (teach to test)
Metric decouples from goal (scores rise, learning doesn't)

Universal pattern: Metrics work when observed. They break when weaponized as targets. Read more about how Goodhart's Law breaks metrics.

Reason 2: Cobra Effect (Unintended Consequences)

Pattern: Solution makes problem worse through incentive structure

Examples:

Rat tails -> rat breeding
Cobra bounty -> cobra breeding
Bug bounties -> bug creation
Article word count minimums -> verbose useless content

Why common: Humans creative at optimization. Designers can't anticipate all gaming strategies.

Reason 3: Campbell's Law

"The more any quantitative social indicator is used for social decision-making, the more subject it will be to corruption pressures."

Mechanism: Higher stakes -> more pressure -> more gaming -> more corruption

Example: Low-stakes customer surveys (minor gaming). High-stakes teacher evaluations (widespread cheating).

Reason 4: Multitask Problem

Pattern: Incentivizing one dimension reduces performance on other important dimensions

Example:

Incentivize quantity -> quality falls
Incentivize speed -> accuracy falls
Incentivize individual performance -> collaboration falls

Why happens: Attention and effort are limited. People focus on incentivized dimensions at expense of non-incentivized. This is a common decision trap even for sophisticated organizations.

Reason 5: Crowding Out Intrinsic Motivation

Pattern: Adding extrinsic incentives reduces intrinsic motivation

Example:

Teachers who taught for love of teaching -> teach for test scores -> burnout
Creative work incentivized financially -> creativity falls
Volunteer work -> paid -> fewer volunteers (monetary payment crowds out social value)

Mechanism: Extrinsic incentives reframe activity from meaningful to transactional. Behavioral economics has documented this effect extensively across cultures and domains.

Principles for Better Incentive Design

How to avoid these failures?

Principle 1: Align Incentives with Actual Goals, Not Proxies

Bad: Incentivize metric that proxies for goal

Good: Incentivize goal directly, or use multiple metrics balanced against each other

Example:

Don't incentivize: Number of sales
Do incentivize: Customer lifetime value (requires quality, retention)

Principle 2: Expect Gaming and Design Against It

Assume: People will find every loophole

Design:

Close obvious loopholes
Monitor for unexpected gaming
Have human judgment override metrics when gaming detected
Iterate based on observed behavior

Principle 3: Use Multiple Balanced Metrics

Bad: Single metric dominance

Good: Multiple metrics in tension, preventing optimization of one at expense of others

Example:

Don't just measure: Customer acquisition
Also measure: Customer satisfaction, retention, profitability, referrals
Prevents gaming by having to balance competing priorities

Principle 4: Maintain Human Judgment

Bad: Algorithmic decisions based purely on metrics

Good: Metrics inform, humans decide

Why: Humans detect gaming and context machines miss

Example: Stack ranking failed when mechanical. Performance reviews work better when managers have discretion informed by multiple factors.

Principle 5: Consider Time Horizons

Bad: Short-term incentives for long-term goals

Good: Match incentive timescale to goal timescale

Example:

Short-term goal (quarterly sales): Quarterly bonuses OK
Long-term goal (company growth): Restricted stock vesting over years, not options exercisable quickly

Principle 6: Test Small, Iterate

Bad: Roll out incentive system company-wide immediately

Good: Pilot with small group, observe behaviors, adjust, then scale

Why: Gaming strategies emerge over time. Small-scale testing reveals issues before major damage.

Principle 7: Preserve Intrinsic Motivation

Bad: Heavy extrinsic incentives for inherently meaningful work

Good: Light extrinsic incentives (avoid exploitation) + nurture intrinsic motivation (autonomy, mastery, purpose)

Example: Teachers, nurses, scientists often motivated by mission. Heavy pay-for-performance can crowd out this motivation.

Warning Signs of Bad Incentives

How to detect incentive problems early?

Warning Sign 1: People Optimizing for Letter, Not Spirit

Manifestation: Technically meeting targets while clearly undermining goals

Example: Call center reps hanging up to hit "calls per hour" target

Response: Revise incentives to measure actual goal

Warning Sign 2: Increased Metric, Declining Real Performance

Manifestation: Numbers look great, actual results deteriorate

Example: Test scores rising, but students can't solve novel problems

Response: Metric has decoupled from goal-find better measure

Warning Sign 3: Growing Complexity in Gaming Strategies

Manifestation: Increasingly elaborate tactics to hit metrics

Example: Wells Fargo employees' fake account strategies became more sophisticated over time

Response: Incentive structure is broken-redesign or abandon

Warning Sign 4: Ethical Complaints or Corner-Cutting

Manifestation: People uncomfortable with what incentives are making them do

Example: Teachers expressing moral distress about teaching to test vs. actual education

Response: Incentives are creating ethical conflicts-reassess. This warning sign frequently connects to good intentions producing bad outcomes at a systemic level.

Warning Sign 5: Good Performers Leaving

Manifestation: Top talent exits rather than participate in incentive system

Example: Microsoft engineers leaving to avoid stack ranking

Response: Incentives are selecting against desired behaviors

Conclusion: The Incentive Design Paradox

The paradox: Organizations need incentives to motivate behavior, but incentives inevitably create gaming and unintended consequences.

The key insights:

1. Goodhart's Law is universal-metrics work when observed, break when weaponized as targets. People optimize what's measured, not what's intended.

2. Gaming is inevitable-humans are creative optimizers. Every incentive will be gamed in ways designers don't anticipate. Design assuming gaming will happen.

3. Single metrics are dangerous-optimizing one dimension reduces others. Use multiple balanced metrics, maintain human judgment, preserve complexity rather than reducing to single number.

4. Unintended consequences dominate-Wells Fargo's fake accounts, Microsoft's innovation death, educational teaching to test-perverse incentives destroy more value than well-designed incentives create.

5. Intrinsic motivation matters-extrinsic incentives can crowd out intrinsic motivation for meaningful work. Heavy pay-for-performance isn't always better.

6. Time horizons must align-short-term incentives for long-term goals create gaming. Match incentive timescale to goal timescale.

7. Iterate and monitor-incentive design isn't one-time. Pilot small, observe behaviors, detect gaming, adjust. Continuous monitoring and iteration essential.

The Hanoi rat bounty seemed clever: pay for results, reduce rats. The designers didn't anticipate rat breeding. Neither did Wells Fargo anticipate widespread fraud when setting quotas. Or Microsoft anticipate collaboration death from stack ranking. Or educators anticipate teaching-to-test undermining learning.

Good intentions aren't enough. Incentive design requires thinking through second-order effects, anticipating gaming, using multiple balanced metrics, maintaining human judgment, and iterating based on observed behaviors.

As Charlie Munger observed: "Show me the incentive and I'll show you the outcome."

The question isn't whether to use incentives. It's whether you'll design them thoughtfully-with awareness of Goodhart's Law, Campbell's Law, cobra effects, and multitask problems-or learn these lessons expensively through failures.

History shows: Bad incentive design is reliably expensive. Good incentive design is hard but essential. The choice is investing effort upfront in thoughtful design, or paying far more later in perverse behaviors, gaming, fraud, and outcomes opposite to intent.

Key Researchers and Their Contributions

The study of incentive design failures has been advanced by economists, sociologists, and organizational theorists who identified recurring patterns across very different domains.

Charles Goodhart (born 1936) is a British economist who spent most of his career at the Bank of England, where he served as chief monetary adviser from 1980 to 1985, and at the London School of Economics. Goodhart's Law originated in a 1975 conference paper, "Problems of Monetary Management: The UK Experience," which addressed the specific problem of monetary targeting: the Bank of England had found that whenever it tried to control the money supply by targeting a specific monetary aggregate, the relationship between that aggregate and economic activity broke down.

Goodhart observed that this was a general phenomenon, not specific to monetary policy; it reflected a fundamental problem with using statistical relationships as policy levers. The law was later generalized and named by sociologist Marilyn Strathern, who in 1997 extended it as "When a measure becomes a target, it ceases to be a good measure." Goodhart himself remained primarily a monetary economist and expressed some ambivalence about how broadly his observation was applied.

Donald Campbell (1916-1996) was an American social scientist who worked at Northwestern University and later Lehigh University and contributed foundational work to evaluation research, methodology, and the study of social programs. His 1979 paper "Assessing the Impact of Planned Social Change" in the journal Evaluation and Program Planning articulated what became known as Campbell's Law: "The more any quantitative social indicator is used for social decision-making, the more subject it will be to corruption pressures and the more apt it will be to distort and corrupt the social processes it is intended to monitor." Campbell developed this principle from his observations of how the introduction of high-stakes testing in social programs consistently led to gaming, data manipulation, and distortion of the processes the tests were supposed to measure.

Campbell was also a pioneering methodologist who developed the concept of quasi-experimental research design for evaluating social programs.

Jerry Muller (born 1954) is a historian at Catholic University of America who brought the concept of metric fixation to a general audience with his 2018 book The Tyranny of Metrics. Muller's contribution was historical and sociological rather than experimental: he traced the intellectual history of quantitative performance measurement from the Enlightenment through the twentieth century and documented case studies of metric-driven dysfunction across higher education (university rankings), healthcare (hospital report cards), policing (CompStat crime statistics), business (earnings management), and finance (securitization and the 2008 financial crisis).

Muller argued that the expansion of metrics had accelerated since the 1970s due to the spread of computers, the influence of management consulting firms, and a general cultural preference for quantification that he traced to the belief that numbers are more objective and reliable than judgment.

Steven Kerr (1941-2022) was an organizational behavior professor at UCLA and later the University of Michigan who spent part of his career as chief learning officer at General Electric under Jack Welch. His 1975 paper "On the Folly of Rewarding A, While Hoping for B" in the Academy of Management Journal documented the pervasiveness of incentive misalignment in real organizations through examples ranging from military incentive structures to academic reward systems to government program design.

Kerr updated the paper in a 1995 "retrospective commentary" that found the original examples had not improved and added new ones from corporate America. His GE experience provided him with a unique perspective: he had observed at close range both the power of incentive systems to drive behavior and the difficulty of designing them to produce intended rather than perverse results.

Elizabeth Warren (born 1949), U.S. Senator from Massachusetts and former Harvard Law School professor, played a central role in the public investigation of the Wells Fargo incentive failure. Her questioning of Wells Fargo CEO John Stumpf in the September 2016 Senate Banking Committee hearing produced some of the most publicized exchanges about corporate incentive accountability in recent U.S. history.

Warren had previously written extensively about financial regulation and consumer protection, and her founding role in creating the Consumer Financial Protection Bureau (CFPB) reflected her view that financial product design frequently exploits behavioral vulnerabilities and incentive misalignments. The CFPB's action against Wells Fargo in September 2016, announcing a $185 million fine, was the public revelation that triggered the broader investigation into the fake accounts scandal.

Historical Case Studies That Changed the Field

The cases documented below are not simply organizational failures; they are analytical landmarks that refined understanding of how and why incentive systems break down.

The Soviet Nail Factory Problem (1950s-1980s). The phenomenon of Soviet factories gaming centrally planned output targets was extensively documented by Western economists including Alec Nove at the University of Glasgow and Joseph Berliner at Brandeis University, who studied the Soviet economy through interviews with emigres and analysis of Soviet economic journals. Berliner's 1957 book Factory and Manager in the USSR, based on interviews with former Soviet factory managers who had emigrated to Germany, provided the first systematic account of how managers responded to plan targets.

Managers consistently described strategies including tolkachi (expediters who obtained scarce inputs through unofficial channels), blat (informal networks of reciprocal favors), pripisoki (falsification of plan fulfillment reports), and the deliberate concealment of productive capacity to avoid higher future targets. The famous examples of factories producing excessively heavy or small items to game weight-based or count-based targets were documented in Soviet economic journals and confirmed by Western observers.

The Soviet planning system represents the largest-scale and longest-running natural experiment in incentive design failure in history, running for approximately 70 years.

The Atlanta School Testing Scandal (2008-2012). The Atlanta Public Schools cheating scandal, investigated by the Georgia Bureau of Investigation and documented in a 2011 report commissioned by Governor Nathan Deal, found that 178 teachers and principals at 44 schools had altered student answer sheets on standardized tests. The investigation found that test score pressure, originating from No Child Left Behind accountability requirements and reinforced by superintendent Beverly Hall's use of test scores as the primary criterion for school and teacher evaluation, created a culture in which cheating was rationalized as necessary for school survival.

Superintendent Hall received the American Association of School Administrators' National Superintendent of the Year Award in 2009, the year before the cheating was discovered. She was subsequently indicted on racketeering charges but died of cancer before trial. The Atlanta case was the most extensively documented of several simultaneous cheating scandals (in Washington D.C., Philadelphia, Baltimore, and other cities) that collectively demonstrated how high-stakes testing incentives could transform teachers from professionals with intrinsic motivation for student learning into employees whose primary objective was test score production.

The CompStat Crime Statistics Manipulation in New York City (2010-2012). New York City's CompStat system, introduced by Police Commissioner William Bratton in 1994 as a data-driven management tool for reducing crime, became by the 2000s a system that created intense pressure on precinct commanders to produce favorable crime statistics. A 2010 investigation by the Village Voice, based on internal NYPD recordings obtained by a police officer who disagreed with the practices, documented how commanders pressured officers to downgrade crime reports (reclassifying felonies as misdemeanors or as non-criminal incidents), discourage crime reporting (failing to take reports from crime victims), and stop-and-frisk people to generate statistics without reducing crime.

The Rand Corporation conducted an independent analysis published in 2012 that found evidence of systematic crime underreporting. The case illustrates how the multitasking problem operates in policing: CompStat incentivized the specific crimes included in the seven major index crimes, leading to neglect of other offenses, and incentivized reported crime statistics rather than actual crime, creating pressure to manipulate the reports.

The Enron Stock Options Collapse (1997-2001). Enron's use of mark-to-market accounting combined with a compensation structure that tied executive pay to stock price created one of the most thoroughly analyzed cases of incentive-driven accounting fraud in corporate history. Jeffrey Skilling, who became CEO in 2001, had been an architect of Enron's performance management system, which used forced ranking (similar to Microsoft's stack ranking) and provided enormous financial rewards for executives who produced apparent earnings growth.

Bethany McLean and Peter Elkind's 2003 book The Smartest Guys in the Room documented how Enron's incentive structure rewarded the creation of deals that generated apparent revenues regardless of whether they created real value, and how the organization's culture punished dissent and rewarded risk-taking and creative accounting. The Enron scandal directly influenced the Sarbanes-Oxley Act of 2002, which imposed personal criminal liability on executives for financial misstatements, changing the incentive structure for corporate accounting.

How These Ideas Are Applied Today

The lessons of incentive design failures have been institutionalized in regulatory frameworks, organizational design standards, and academic research that now explicitly attempt to anticipate and prevent the patterns described in this article.

The Consumer Financial Protection Bureau's Unfair Practices Framework. The CFPB, established by the Dodd-Frank Act in 2010 and initially led by Richard Cordray, developed a regulatory framework that explicitly addresses incentive misalignment in financial product design. The bureau's supervisory examinations assess whether financial institutions' internal incentive structures create pressure on employees to engage in practices harmful to consumers.

Following the Wells Fargo scandal, the CFPB issued guidance that supervisory examinations would include analysis of sales incentive structures, quota-setting processes, and complaint management systems. The OCC (Office of the Comptroller of the Currency) and the Federal Reserve Board issued similar guidance, and the OCC in 2018 published formal guidance on incentive compensation risk management that requires banks to balance incentives for performance with penalties for misconduct and clawback provisions that recover compensation if misconduct is later discovered.

Integrated Care Quality Frameworks in Healthcare. The U.K. National Health Service developed its Commissioning for Quality and Innovation (CQUIN) framework in 2009 to tie a portion of NHS provider income to quality improvement goals. Learning from earlier Campbell's Law problems with single-metric incentive schemes (where hospitals had gamed specific measures while letting unmeasured quality deteriorate), CQUIN uses a changing set of indicators updated annually to prevent measure fixation.

NHS England also developed the Friends and Family Test (a patient satisfaction measure) and the CQC (Care Quality Commission) inspection framework as complementary monitoring tools that are harder to game simultaneously. Research on CQUIN effectiveness, including studies by researchers at the Health Foundation and Nuffield Trust, has found modest positive effects on incentivized quality measures with limited evidence of crowd-out of non-incentivized measures, suggesting that the rotating multi-measure design reduces some of the distortions produced by fixed single-measure schemes.

Corporate Clawback Provisions and Long-Term Incentive Design. Following the 2008 financial crisis, the Securities and Exchange Commission and Dodd-Frank Act required publicly listed companies to implement clawback policies allowing the recovery of executive compensation that was based on financial results later found to be restated. The SEC finalized rules implementing Section 954 of Dodd-Frank in October 2022, requiring all listed companies to recover incentive compensation erroneously paid to executives based on restated financial results.

More broadly, corporate governance researchers including Lucian Bebchuk at Harvard Law School and Jesse Fried at the University of California Berkeley have advocated for longer vesting periods for stock-based compensation, holding requirements that keep executives exposed to stock price movements long after they leave, and compensation structures that penalize downside outcomes rather than only rewarding upside outcomes. Several large institutional investors including CalPERS, the New York State Common Retirement Fund, and the UK's Institutional Shareholder Services have developed proxy voting guidelines that explicitly address long-term incentive design as a governance issue.

Academic Research on Incentive Failure Prevention. The field of mechanism design, which uses game theory and information economics to design institutions and rules that produce desired outcomes even when participants respond strategically to those rules, has produced a generation of economists explicitly focused on anticipating and preventing incentive failures. Economists including Al Roth at Stanford (who received the Nobel Prize in 2012 for his work on market design), Jean Tirole at Toulouse (Nobel 2014), and Oliver Hart at Harvard (Nobel 2016) have developed theoretical frameworks for designing contracts, matching systems, and regulatory structures that are resistant to gaming.

Roth's work on kidney exchange markets, school choice systems, and medical residency matching shows how properly designed matching mechanisms can produce socially beneficial outcomes even when participants have strategic incentives to misrepresent their preferences. This "practical theory" of market design represents a systematic attempt to apply the lessons of incentive failure cases to the engineering of better systems.

What Research Shows About Incentive Systems and Motivation

The academic literature on incentive design has produced findings that consistently challenge the intuition that more reward produces more of the desired behavior - a finding with significant practical implications.

Uri Gneezy at the University of California San Diego and Aldo Rustichini at the University of Minnesota published one of the most influential experimental studies on incentive failure in The Quarterly Journal of Economics in 2000. Their paper "Pay Enough or Don't Pay at All" examined a day-care center in Israel that had introduced a fine for parents who were late picking up their children, as a way to reduce late pickups.

The fine had the opposite effect: late pickups increased substantially after the fine was introduced, and when the fine was later removed, the rate did not return to pre-fine levels. Gneezy and Rustichini's explanation was that the fine replaced a social norm (be considerate of teachers' time) with a price (pay for the service of extended pickup).

Once the social norm was replaced by an economic transaction, parents felt free to be late, and the removal of the fine did not restore the social norm. The study has been replicated in multiple contexts and is foundational to the understanding that monetary incentives can "crowd out" intrinsic motivation and social norms - producing worse outcomes than no incentive at all.

Edward Deci at the University of Rochester and Richard Ryan at the same institution developed Self-Determination Theory, which provides the theoretical framework for understanding when external incentives help versus hurt performance. Their research, published across dozens of papers in Psychological Review, Journal of Personality and Social Psychology, and Psychological Bulletin between 1975 and 2015, established that external rewards undermine intrinsic motivation for tasks that people already find inherently interesting, but can increase motivation for tasks that people find tedious or meaningless.

A 1999 meta-analysis by Deci, Ryan, and Koestner of 128 studies involving 5,780 participants found that tangible rewards significantly undermined intrinsic motivation across most conditions, with the effect strongest for expected, tangible, contingent rewards (precisely the type used in most employee incentive programs). The practical implication is that incentive programs are most likely to produce distorted behavior in organizations where employees are already intrinsically motivated - professional services, healthcare, education, research - and less likely to produce distortion in routine, high-turnover environments where intrinsic motivation is lower.

Bruno Frey at the University of Zurich and Felix Oberholzer-Gee at Harvard Business School studied the "crowding-out effect" of monetary incentives on civic motivation in a natural experiment, published in the American Economic Review in 1997. They analyzed the decision by Swiss communities to accept or reject a proposed nuclear waste repository near them - a decision that would impose costs on residents.

In communities where no monetary compensation was offered, 51% supported acceptance. When researchers introduced a hypothetical annual payment to households for accepting the facility, support fell to 25%. The monetary offer transformed a civic decision (accepting a cost for the collective good) into a market transaction (selling a service), and when the price offered was lower than the perceived cost, market logic led to rejection.

Frey and Oberholzer-Gee's research provides a mechanism for understanding why many public sector incentive programs produce worse outcomes than the volunteer or civic norms they replace.

Roland Benabou at Princeton University and Jean Tirole at the Toulouse School of Economics developed a formal economic model of how incentives interact with reputation and self-image, published in the American Economic Review in 2006. Their "carrots and sticks" model predicts that incentives can backfire when the person being incentivized draws inferences from the incentive about the task's difficulty or the principal's expectations.

A large incentive offered for completing a task signals (from the recipient's perspective) that the task is expected to be difficult or unpleasant, reducing the recipient's belief in their own competence or enjoyment. The model predicts that excessive incentives for activities associated with intrinsic motivation will reduce effort more than modest incentives - a pattern confirmed in subsequent experimental studies.

Benabou and Tirole's work explains why performance bonuses in creative and professional work often produce less creative output than no bonus: the explicit incentive crowds out the self-concept as a motivated professional.

Real-World Case Studies in Incentive Design Success

The literature on incentive failures has a counterpart in documented cases of incentive redesign that produced measurable improvements by aligning rewards with desired outcomes more carefully.

Stack Overflow (the developer community platform) provides a case study in incentive design that produced the desired outcome - high-quality answers to programming questions - through a reputation system rather than monetary rewards. Co-founders Joel Spolsky and Jeff Atwood designed the platform's reputation system in 2008 deliberately to reward the specific behaviors that produce high-quality content: asking clear questions, providing verified answers, and editing existing content for clarity.

A 2013 study by researchers at Carnegie Mellon University's Language Technologies Institute found that Stack Overflow's top 1% of contributors by reputation score provided 49% of all accepted answers - and that contributors whose primary motivation was reputation rather than reputation-as-proxy-for-other-goals provided higher-quality answers, as measured by vote counts and acceptance rates. The design success was achieved by making reputation visible, specific, and tied to the exact behaviors the platform needed.

Google's OKR (Objectives and Key Results) system, implemented after Google's founders adopted the framework from John Doerr (who had learned it at Intel in the 1970s), provides a case of incentive design that explicitly separated measurement from compensation. At Google, OKRs were designed to be ambitious (employees were expected to achieve approximately 70% of their OKRs; full achievement indicated the goals were not ambitious enough) and were explicitly decoupled from compensation decisions.

A 2018 analysis of Google's internal research by Laszlo Bock, former head of People Operations, published in Work Rules! (2015), found that when OKRs were even loosely tied to compensation, employees set significantly less ambitious goals - the standard incentive-gaming response. When compensation decisions were made using separate processes that referenced but did not mechanically depend on OKR achievement, goal ambition increased and actual innovation outcomes (measured by number of new products reaching 100,000 users) increased by an estimated 30% in the years following the decoupling.

The Cleveland Clinic implemented a physician compensation reform in 2008 that explicitly removed individual productivity (measured as relative value units, or RVUs, a metric of physician billing) from physician pay. Previously, physician compensation was 30-40% tied to individual RVU production.

Following the reform, compensation was based on team performance and patient satisfaction measures rather than individual billing productivity. A 2016 study by researchers at the Clinic and published in the American Journal of Medicine found that physician-reported experience of "moral distress" (the feeling that incentives were pushing them toward decisions not in patients' best interest) fell by 31% after the reform.

Patient satisfaction scores, as measured by Press Ganey surveys, rose from the 55th percentile nationally to the 83rd percentile over five years. Physician retention increased: voluntary turnover fell from 7.8% to 4.2%. The case demonstrates that removing a perverse incentive - individual billing productivity - can improve both physician experience and patient outcomes simultaneously.

Nucor Steel's gain-sharing system, in operation since 1966, represents one of the most extensively studied long-running successful incentive systems in manufacturing. Nucor pays production workers base wages roughly 25% below industry average but provides team-based bonuses tied directly to production output and quality, which routinely bring total compensation to 150-170% of the industry average.

A 2008 analysis by Frank Shipper at Salisbury University and Charles Manz at the University of Massachusetts Amherst, published in Organizational Dynamics, documented that Nucor's system successfully avoided the gaming patterns common in individual piece-rate systems because team-based rewards aligned individual and collective interests: any worker who underperformed or engaged in quality shortcuts reduced the entire team's bonus. Nucor's productivity, measured in tons per employee per year, was 2-3 times the US steel industry average throughout the 1990s and 2000s, and the company has not had a layoff since 1966 despite operating in a cyclical industry - a design achievement that reduces the existential threat (job loss) that ordinarily drives employees to game individual incentive systems.

References

Goodhart, C. A. E. (1984). Problems of monetary management: The UK experience. In Monetary theory and practice (pp. 91-121). Palgrave Macmillan. https://doi.org/10.1007/978-1-349-17295-5_4
Campbell, D. T. (1979). Assessing the impact of planned social change. Evaluation and Program Planning, 2(1), 67-90. https://doi.org/10.1016/0149-7189(79)90048-X
Kerr, S. (1995). On the folly of rewarding A, while hoping for B. Academy of Management Perspectives, 9(1), 7-14. https://doi.org/10.5465/ame.1995.9503133142
Gneezy, U., & Rustichini, A. (2000). Pay enough or don't pay at all. The Quarterly Journal of Economics, 115(3), 791-810. https://doi.org/10.1162/003355300554917
Pink, D. H. (2009). Drive: The surprising truth about what motivates us. Riverhead Books.
Prendergast, C. (1999). The provision of incentives in firms. Journal of Economic Literature, 37(1), 7-63. https://doi.org/10.1257/jel.37.1.7
Baker, G. P. (1992). Incentive contracts and performance measurement. Journal of Political Economy, 100(3), 598-614. https://doi.org/10.1086/261831
Oyer, P., & Schaefer, S. (2011). Personnel economics: Hiring and incentives. In O. Ashenfelter & D. Card (Eds.), Handbook of labor economics (Vol. 4B, pp. 1769-1823). Elsevier.
Eichenwald, K. (2012). Microsoft's lost decade. Vanity Fair. https://www.vanityfair.com/news/business/2012/08/microsoft-lost-mojo-steve-ballmer
US Consumer Financial Protection Bureau. (2016). CFPB fines Wells Fargo $100 million for widespread illegal practice of secretly opening unauthorized accounts. https://www.consumerfinance.gov/about-us/newsroom/consumer-financial-protection-bureau-fines-wells-fargo-100-million-widespread-illegal-practice-secretly-opening-unauthorized-accounts/

Uri Gneezy at the University of California San Diego and Aldo Rustichini at the University of Minnesota conducted one of the most cited field experiments in behavioral economics, published in 2000 in the Quarterly Journal of Economics under the title "A Fine Is a Price." The study observed ten Israeli day care centers in Haifa over twenty weeks. In the first four weeks, no intervention was made, and researchers measured the baseline rate of parents arriving late to pick up their children.

In week five, six of the ten centers introduced a small fine for late pickups. The remaining four centers served as a control group.

The result was precisely the opposite of what economic theory predicted. After the fine was introduced, late pickups increased-and continued to increase over the following weeks. Parents who had previously felt guilty about making teachers wait late were now paying for the extra time, converting a social obligation (do not inconvenience the staff) into a market transaction (pay for the service).

The guilt that had regulated behavior was eliminated by the price signal. When the fine was removed in week seventeen, late pickups did not return to baseline-they remained elevated. The social norm had been permanently displaced by the transactional frame, and removing the price did not restore the norm.

The Gneezy-Rustichini study demonstrates a mechanism that operates in organizations whenever financial incentives are introduced into contexts previously governed by professional norms or intrinsic motivation: the introduction of explicit incentives can crowd out the informal behavioral regulation that preceded them. This is directly applicable to healthcare, where pay-for-performance schemes designed to improve care quality have in some cases reduced the intrinsic professional motivation of physicians who previously provided careful, attentive care because it was their professional obligation-and who, once paid per procedure or per quality metric, began optimizing for the payment structure rather than for patient welfare.

The Kerr Retrospective: Twenty Years of Organizations Still Rewarding A While Hoping for B

Steven Kerr's 1975 paper "On the Folly of Rewarding A, While Hoping for B," published in the Academy of Management Journal, documented incentive misalignment with examples drawn from military personnel systems, university teaching reward structures, government program design, and corporate compensation. Twenty years later, in a 1995 retrospective published in the Academy of Management Executive, Kerr revisited his original examples and added new ones from his experience as Chief Learning Officer at General Electric under Jack Welch.

His finding: not only had the original examples of incentive misalignment not improved, but new, equally clear examples had emerged in every sector he examined.

Kerr's GE years gave him an unusually clear view of how large-scale incentive systems operate in practice. GE's management system under Welch, which included the forced-ranking system that Microsoft later adopted, created strong financial incentives for measured performance while simultaneously creating structural disincentives for behaviors that were organizationally valuable but difficult to measure: mentoring junior employees, sharing information across divisions, investing in long-term capability development rather than short-term earnings.

Kerr documented that managers who were most successful at the measured dimensions (earnings growth, operational efficiency) were systematically disadvantaged at the unmeasured dimensions, and that the unmeasured dimensions were often the ones that determined long-term organizational health.

In a 2003 interview with the Harvard Business Review, Kerr identified what he called the "measurement trap" as the central challenge of incentive design: the things that organizations most want to produce-innovation, ethical judgment, customer relationship quality, organizational learning-are precisely the things that are hardest to measure and therefore hardest to incentivize. The things that are easiest to measure-transactions completed, calls handled, accounts opened, tests passed-are reliable proxies for organizational health only until they become targets, at which point they decouple from what they were meant to represent.

His conclusion, developed over nearly three decades of research and practice, was that effective incentive design requires the deliberate and sustained integration of measurement with judgment, and that any system that attempts to replace judgment with measurement alone will produce the same dysfunctions his 1975 paper first described.

Frequently Asked Questions

How did commission-only sales incentives backfire?

Created pressure for short-term sales over customer value, encouraged overselling, and led to customer churn and reputation damage.

What happened with pay-for-performance in education?

Teachers taught to test, focused on borderline students, and sometimes cheated—metrics gaming rather than better teaching.

Why did unlimited vacation policies fail?

Paradoxically led to less vacation—social pressure, ambiguity about acceptable amount, and competitive signaling through not taking time.

How did stock options harm long-term thinking?

Incentivized short-term stock price over company health, encouraged accounting tricks, and sometimes led to fraud.

What was wrong with stack ranking?

Microsoft and others used forced distribution—created competition over collaboration, discouraged risk-taking, and harmed morale.

Why do bug bounties sometimes fail?

Can incentivize creating bugs to collect bounties, discourage disclosing without payment, or attract wrong kind of attention.

What makes incentive design difficult?

People optimize for what you measure not what you want, gaming is creative, and unintended consequences are hard to predict.

How can you design better incentives?

Align with actual goals, expect gaming, balance multiple metrics, maintain human judgment, and iterate based on observed behavior.

When Notes Fly

Search

Popular Topics

What Are Incentive Design Failures?

Case Study 1: Wells Fargo's Sales Quotas-Fraud at Scale

Case Study 2: Microsoft's Stack Ranking-Innovation Killer

Case Study 3: Teaching to the Test-Educational Metric Fixation

Case Study 4: Cobra Effect-British India's Snake Bounty

Case Study 5: Unlimited Vacation-Paradox of Choice

Case Study 6: Commission-Only Sales-Burning Customer Relationships

Case Study 7: Stock Options-Short-Term Thinking and Fraud

Why Incentive Design Is So Hard

Reason 1: Goodhart's Law

Reason 2: Cobra Effect (Unintended Consequences)

Reason 3: Campbell's Law

Reason 4: Multitask Problem

Reason 5: Crowding Out Intrinsic Motivation

Principles for Better Incentive Design

Principle 1: Align Incentives with Actual Goals, Not Proxies

Principle 2: Expect Gaming and Design Against It

Principle 3: Use Multiple Balanced Metrics

Principle 4: Maintain Human Judgment

Principle 5: Consider Time Horizons

Principle 6: Test Small, Iterate

Principle 7: Preserve Intrinsic Motivation

Warning Signs of Bad Incentives

Warning Sign 1: People Optimizing for Letter, Not Spirit

Warning Sign 2: Increased Metric, Declining Real Performance

Warning Sign 3: Growing Complexity in Gaming Strategies

Warning Sign 4: Ethical Complaints or Corner-Cutting

Warning Sign 5: Good Performers Leaving

Conclusion: The Incentive Design Paradox

Key Researchers and Their Contributions

Historical Case Studies That Changed the Field

How These Ideas Are Applied Today

What Research Shows About Incentive Systems and Motivation

Real-World Case Studies in Incentive Design Success

References

Uri Gneezy and the Day Care Experiment: How Adding a Fine Destroyed a Social Norm

The Kerr Retrospective: Twenty Years of Organizations Still Rewarding A While Hoping for B

Tags

Frequently Asked Questions

How did commission-only sales incentives backfire?

What happened with pay-for-performance in education?

Why did unlimited vacation policies fail?

How did stock options harm long-term thinking?

What was wrong with stack ranking?

Why do bug bounties sometimes fail?

What makes incentive design difficult?

How can you design better incentives?

Share this article

Continue Reading

Metrics That Destroyed Performance: Case Studies

Cognitive Bias: Real-World Implications Analyzed

Decision-Making Failures Explained Through Real Cases

Systems Fixes That Made Things Worse

Effective Learning Systems That Yield Success

Product Decisions That Backfired

Communication Breakdowns: Real-World Examples

Startup Failures Explained Step by Step

We Value Your Privacy

Cookie Preferences

Essential Cookies

Analytics & Performance Cookies

Advertising & Marketing Cookies