In colonial India, the British administration in Delhi grew concerned about the city's cobra population. Their solution was straightforward: offer a bounty for every dead cobra. Citizens would collect the snakes, kill them, and claim payment. Simple incentives, measurable outcomes, problem solved.

Except it wasn't. Enterprising locals began breeding cobras specifically to collect the bounty. When the colonial government discovered this and cancelled the program, the cobra farmers released their now-worthless snakes, and the population surged beyond its original level.

This story — possibly apocryphal, but instructive regardless — gave its name to a phenomenon that plagues every organization that measures performance: the Cobra Effect. When you create an incentive for a metric, rational actors respond to the incentive rather than the underlying goal. The measure improves while the reality it was supposed to represent gets worse.

Understanding how and why metric gaming happens, what forms it takes in modern organizations, and how to design measurement systems that resist it, is one of the most practically valuable things a manager or team lead can learn.


Goodhart's Law: The Theory Behind the Problem

The formal version of this problem was stated by British economist Charles Goodhart in a 1975 paper on monetary policy. Goodhart's Law, as it came to be known, states:

"Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes."

The simplified version, widely attributed to sociologist Marilyn Strathern, captures it more memorably: "When a measure becomes a target, it ceases to be a good measure."

Goodhart's original context was narrow — he was observing how targeting monetary aggregates in UK policy caused those aggregates to lose their predictive value. But the principle generalizes widely. Every metric is a proxy for something we actually care about. The metric works as a proxy when people are not aware it is being measured. Once it becomes a target, the relationship between the metric and the underlying reality breaks down because people can optimize for the number without changing (or while actively degrading) the underlying reality.

This insight sits at the intersection of economics, psychology, and organizational theory. It appears in different forms across multiple disciplines: as the Lucas Critique in macroeconomics (Robert Lucas, 1976), which observes that economic models based on historical behavior break down once agents become aware the model is being used to make policy; as Campbell's Law in social science (Donald Campbell, 1979), which states that "the more any quantitative social indicator is used for social decision-making, the more subject it will be to corruption pressures"; and as the observer effect in management, where the act of measuring changes what is measured.

These parallel discoveries across independent fields are not coincidental. They reflect a fundamental property of any system in which intelligent agents adapt their behavior to measurement regimes: the agents will optimize for what is measured, and what is measured will cease to predict what it used to predict.


Why Metric Gaming Is Rational

Before examining examples, it is important to establish that metric gaming is not primarily a moral failure. It is the rational response to misaligned incentives.

Employees face two kinds of demands simultaneously:

  1. Do the actual work that achieves the organization's goals.
  2. Perform well on whatever metrics are used to evaluate and reward them.

When these two demands are aligned, metric gaming is unnecessary — doing good work naturally produces good metrics. But when they diverge, employees face a choice between doing their job well and appearing to do their job well. In most organizational environments, appearance wins, because:

  • Visibility: Metrics are visible to management; underlying quality is often not.
  • Speed: Metric gains are immediate; real-world outcomes may lag by months.
  • Reward: Compensation, promotion, and recognition are tied to metrics.
  • Safety: Missing metrics has consequences; subtly degrading quality often does not.

This is not cynicism. It is the predictable output of any system that rewards proxy measures rather than real outcomes. As W. Edwards Deming argued repeatedly in his work on quality management, the vast majority of performance problems are system problems, not individual failures. Deming's Point 11 from his famous 14 Points for Management explicitly warns against management by objectives and numerical quotas, noting that they destroy pride of workmanship by encouraging people to meet the number rather than do the job.

Alfie Kohn (1999) in Punished by Rewards synthesized decades of research on incentive systems and concluded that extrinsic rewards — including performance metrics tied to pay — consistently reduce intrinsic motivation, creativity, and long-term performance. The more important a metric becomes in evaluation, the more it distorts the behavior it was designed to measure.


How Metric Gaming Manifests in Practice

Call Center Metrics Gaming

The call center is the canonical example of metric gaming because its measurement systems are so explicit and the consequences so visible.

Common call center metrics include:

  • Average Handle Time (AHT): time spent per call
  • First Call Resolution (FCR): percentage of issues resolved on first contact
  • Customer Satisfaction Score (CSAT): post-call survey ratings
  • Calls per Hour: volume throughput

When AHT is targeted, agents rush customers off the phone. Calls get shorter. AHT improves. FCR plummets as issues are not actually resolved. Customer satisfaction falls. When FCR is subsequently targeted, agents may put callers on hold indefinitely rather than escalating, claiming resolution on calls that are not resolved, or instructing customers not to call back.

When CSAT becomes the key metric, agents game surveys directly — coaching customers to give high ratings, suggesting at the end of calls "Is there anything I could have done to deserve a 10 today?", or excluding difficult customers from survey invitations where possible.

A 2018 study by Forman and Garg published in the Journal of Service Research found that frontline service workers in measured environments reported significantly higher rates of survey coaching behavior when CSAT was directly tied to individual bonuses compared to when it was tracked at the team level. The individual incentive, designed to improve customer satisfaction, reduced it by producing survey results that no longer reflected actual customer experience.

The individual metrics each look fine. The customer experience deteriorates.

Agile Velocity Inflation

In agile software development, velocity measures the number of story points a team completes per sprint. It is designed as a planning tool — to help teams estimate how much work they can take on — not as a performance metric.

When management begins using velocity as a measure of team productivity, the gaming begins almost immediately.

Teams facing velocity pressure typically respond by:

Gaming Behavior How It Looks What It Hides
Inflating story point estimates Higher points per story No actual complexity increase
Breaking stories into many small tasks More completions per sprint Artificial fragmentation of work
Prioritizing easy "low-hanging fruit" Velocity improves Technical debt and complex work accumulate
Marking stories complete prematurely Sprint burndown looks clean Rework hidden in future sprints
Avoiding hard-to-estimate work Predictable velocity Important, complex work delayed

The result is a team whose velocity numbers climb while actual output — features users value, technical quality, system reliability — stagnates or declines. Velocity, intended as a team planning tool, has been converted into a performance theater metric.

This dynamic was described prophetically in Robert D. Austin's 1996 book Measuring and Managing Performance in Organizations, which remains one of the most rigorous treatments of why measurement dysfunction occurs. Austin distinguishes between motivational measurement (used to evaluate and reward) and informational measurement (used to learn and improve), arguing that the two functions are fundamentally incompatible: measurement used for reward will always tend toward gaming, while measurement used for learning can remain honest.

Martin Fowler, one of the originators of agile software development, has written extensively on velocity gaming, describing it as one of the most common failures in agile adoption. He notes that the solution is not better measurement of velocity but abandoning velocity as a cross-team comparative metric and returning it to its intended use as a single-team planning tool.

Sales Metrics Gaming

Sales organizations face similar dynamics. When call volume is measured, salespeople make shallow calls that do not advance relationships. When demos booked is the target, salespeople book unqualified demos that waste engineering time. When pipeline value is tracked, they inflate deal sizes or include early-stage prospects in committed pipeline.

When annual quota achievement is the primary metric with cliff effects (nothing for 99%, bonus for 100%+), salespeople engage in sandbagging — deliberately holding deals until the next period when they have already exceeded or failed quota — creating artificial lumpiness in revenue that makes forecasting impossible.

The consequences extend beyond forecasting. A study by Larkin, Pierce, and Gino (2012) published in the Academy of Management Journal found that highly competitive sales environments with strong individual metric incentives produced significantly higher rates of misrepresentation to customers — salespeople overstated product capabilities, understated pricing complexities, and made commitments the product could not keep — compared to environments with more balanced incentive structures.

Healthcare Metrics Gaming

The UK's National Health Service provides a well-documented case. After waiting time targets were introduced in emergency departments (patients must be seen within four hours), hospitals found creative ways to meet the metric without reducing actual waits:

  • Patients were assessed briefly by a nurse to formally "start the clock," then placed back in the waiting area.
  • Ambulance handovers were delayed at hospital entrances so the four-hour clock did not start.
  • Admissions were reclassified as "day cases" to avoid the target's scope.

The metric improved. The experience of patients waiting in ambulances or hallways did not.

Bevan and Hood (2006) documented this dynamic in a rigorous study published in Public Administration, examining the NHS targets regime across multiple metrics. Their research found that each new target introduced generated new gaming behaviors within weeks of implementation, and that the gaming often created worse outcomes than the original problem the target was designed to address.

Similar dynamics have been documented in education. The No Child Left Behind Act in the United States, which tied school funding to standardized test scores, produced extensive evidence of teaching-to-the-test, score manipulation, and in some documented cases, outright cheating by administrators under pressure. A 2009 investigation in Atlanta, Georgia, eventually resulted in criminal convictions of school administrators who had altered student test answers to meet performance targets.

Software Bug Count Gaming

Software quality metrics provide a particularly instructive example because the gaming is technically sophisticated and invisible to anyone not familiar with development practices.

When development teams are measured on bug count (number of open defects), they respond predictably: bugs are closed prematurely, reclassified as "features" rather than defects, combined into composite issues that count as single bugs, or simply not logged in the first place. A development team whose bug count metric looks clean may be producing software that is actively getting worse — the bugs are simply no longer being recorded honestly.

When teams are measured on zero-defect sprints (sprints with no production bugs reported), they respond by making production monitoring less sensitive, routing bug reports to different queues, or defining "production" more narrowly. The metric improves while the product quality deteriorates.

Gene Kim, Patrick Debois, John Willis, and Jez Humble (2016) in The DevOps Handbook describe this problem extensively and recommend measuring deployment frequency, lead time for changes, mean time to restore, and change failure rate as a portfolio — arguing that these four metrics, taken together, are significantly harder to game simultaneously than any single metric in isolation.


The Deeper Problem: Measurement Changes What Is Measured

There is a subtler effect beyond gaming. The act of measuring something changes how people think about and perform the underlying activity. This is sometimes called the observer effect in management.

When teachers are evaluated on test scores, they teach to the test. This is rational, but it changes the nature of education — more time drilling tested skills, less time on untested creative thinking, collaborative projects, or content outside the exam's scope. Over time, what is tested tends to become what is taught, even among teachers who believe in broader educational goals.

This effect operates even without explicit gaming. The measurement creates salience that reshapes attention and effort allocation. Things that are measured feel important; things that are not measured can feel like they do not count, even when everyone knows intellectually that they do.

Alison Davis-Blake, former dean of the University of Michigan's Ross School of Business, observed this in academic research evaluation: when citation counts became a dominant measure of research impact, academic writing subtly shifted toward citation-maximizing strategies — citing authors likely to reciprocate, writing short papers that create citation chains, and targeting high-impact journals regardless of whether they were the best fit for the work.

Kahneman (2011) in Thinking, Fast and Slow describes the related phenomenon of what you see is all there is: human cognition systematically overweights visible, measurable information and underweights invisible, immeasurable information. In organizations, this means metrics naturally crowd out judgment — not because anyone decides they should, but because measurable things feel more real than unmeasurable things. The result is that unmeasured dimensions of performance — quality of relationships, depth of learning, long-term strategic thinking — gradually receive less attention simply because they do not appear on the dashboard.


Designing Metrics That Are Harder to Game

No metric is entirely ungameable. The goal is not perfection but making gaming harder, more visible, and less rewarding. Several design principles help.

Measure Outcomes, Not Activities

Activity metrics (calls made, features shipped, hours worked) are the easiest to game because they measure effort that can be performed without producing results. Outcome metrics (revenue generated, bugs in production, customer retention rate) are harder to fake because they require the underlying reality to change.

The tradeoff is that outcome metrics lag. You learn about revenue this quarter, not what caused it. This is why effective measurement systems combine both.

Use Portfolios of Correlated Metrics

If gaming one metric always shows up negatively in another, gaming becomes costly. A call center that measures both AHT and FCR simultaneously creates a natural tension: rushing customers to reduce AHT tends to reduce FCR. Salespeople who inflate pipeline value face scrutiny when close rates fall. Software teams that inflate velocity face consequences when deploy-to-production rates and bug rates are also tracked.

The key is choosing metrics that are genuinely correlated with the outcome and that pull in different directions if gamed.

The DORA metrics (Deployment Frequency, Lead Time for Changes, Mean Time to Restore, and Change Failure Rate) from the DevOps Research and Assessment team provide a well-validated example of a correlated portfolio. Research by Forsgren, Humble, and Kim (2018) in Accelerate demonstrated that high performance on all four metrics simultaneously is strongly associated with organizational performance, and that gaming any one metric in isolation inevitably degrades others — making the portfolio substantially harder to game than any single metric would be.

Build in Qualitative Checks

Quantitative metrics should be accompanied by qualitative assessment that cannot be easily gamed: customer interviews, peer reviews, manager observation, random audits. These are harder to scale but they catch systematic gaming that aggregated numbers miss.

Net Promoter Score (NPS), one of the most widely used customer satisfaction metrics, is notoriously easy to game through survey timing and question framing. Organizations that supplement NPS with regular customer advisory panels, unfiltered support ticket analysis, and customer loss interviews typically discover that their "NPS-leading" practices are producing very different results than their actual retention rates would suggest.

Rotate and Refresh Metrics

Once employees learn to game a metric, the game tends to be played indefinitely. Periodically replacing or substantially revising metrics disrupts established gaming strategies and forces renewed attention to underlying performance. This is operationally disruptive but often worth it in contexts where gaming is severe.

Measure at Multiple Levels

Gaming that looks rational at the individual level often becomes irrational at the team or organizational level. Measurement systems that aggregate upward and make team-level patterns visible can expose individual gaming that is invisible in individual data.

Separate Incentive and Learning Metrics

One powerful principle from Austin's work: separate the metrics used for performance management (and tied to pay and promotion) from the metrics used for learning and improvement. When a metric is tied to rewards, gaming becomes inevitable. Diagnostic metrics used only by teams for their own improvement are less subject to gaming because the incentive to distort them is weaker.

This principle is one reason why blameless post-mortems — incident reviews in which the goal is learning rather than accountability — produce more honest and useful information than reviews where individuals face consequences for what is discovered. The same information exists in both cases; the incentive to conceal it differs.


Balancing Lag and Lead Indicators

The distinction between lag indicators (what happened) and lead indicators (what predicts what will happen) is fundamental to designing measurement systems that drive improvement rather than game-playing.

Indicator Type Examples Strength Weakness
Lag Revenue, customer churn, product defects Measures actual outcomes Cannot be acted on in time
Lead Prospect meetings, feature cycle time, employee engagement Enables early course correction Proxy for outcome; gameable

Organizations that measure only lag indicators are flying blind until it is too late. Organizations that measure only lead indicators are optimizing proxies and may be gaming their way to metric success while the underlying business deteriorates.

The most effective measurement frameworks, like the Balanced Scorecard (Kaplan and Norton, 1992) and OKRs (Objectives and Key Results, popularized by John Doerr's Measure What Matters, 2018), explicitly combine both: ambitious outcome goals (what we want to achieve) supported by leading activity metrics (how we are tracking toward those goals), with regular review cycles that check whether the lead metrics are actually predicting the outcomes.

Kaplan and Norton (1992), writing in the Harvard Business Review when they introduced the Balanced Scorecard, explicitly addressed the gaming problem: "What you measure is what you get. Senior executives understand that their organization's measurement system strongly affects the behavior of managers and employees." Their framework was designed as a direct response to the observation that single-metric systems, particularly financial metrics alone, consistently produce gaming that degrades the long-term health of the organization while improving short-term numbers.


The Metrics Conversation: What Good Looks Like

Organizations that manage metrics well share several observable characteristics:

Leaders treat metrics as questions, not answers. When a metric improves, the first question is "why did it improve, and does the improvement reflect genuine progress?" rather than "good, the number is up." This habit of interrogating metric movements — both positive and negative — is what keeps measurement systems honest over time.

Teams own their measurement. When teams select, track, and interpret their own metrics, the incentive to game them is weaker because the gaming hurts the team's own learning. When metrics are imposed from above and used primarily for external reporting, the audience for the number is management — and the incentive is to make management happy, not to reflect reality.

The measurement system evolves. Static measurement systems accumulate gaming over time. Organizations that regularly audit whether their metrics are still predicting what they were designed to predict — and that retire or replace metrics that have drifted from their purpose — maintain the integrity of their measurement over the long term.

Qualitative data supplements quantitative data. Numbers compress reality. The most important information about an organization's health is frequently in conversations, interviews, and observations that cannot be easily quantified. Organizations that maintain rich qualitative feedback mechanisms alongside their metrics catch early warning signs that the numbers cannot see.


Conclusion: The Measurement Paradox

Organizations face an inescapable paradox: without measurement, there is no visibility into performance, no way to learn, and no accountability. With measurement, rational actors optimize for measures rather than underlying goals, and the measures gradually lose their validity.

The solution is not to stop measuring. It is to:

  1. Treat all metrics as provisional and imperfect proxies, not as reality itself.
  2. Design measurement portfolios with correlated metrics that make gaming costly.
  3. Separate learning metrics from incentive metrics where possible.
  4. Build in qualitative checks and audits.
  5. Revisit and rotate metrics regularly.
  6. Measure outcomes, not just activities.
  7. Fix the system before blaming the people.

The Cobra Effect is not a sign of employees behaving badly. It is a sign of organizations designing incentive systems without thinking through their second-order effects. Cobras bred for bounty money are as predictable as story points inflated for velocity dashboards. The question is whether leadership will recognize the pattern before the snakes get released.

The deeper lesson is epistemological: every number an organization uses to understand itself is a model of reality, not reality itself. Models are useful precisely because they compress complexity — and dangerous for exactly the same reason. The organization that forgets the map is not the territory, and treats its metrics as ground truth, will find those metrics gradually drifting from any territory that matters.

Frequently Asked Questions

What is Goodhart's Law?

Goodhart's Law states that 'when a measure becomes a target, it ceases to be a good measure.' Originally formulated by economist Charles Goodhart in 1975 regarding monetary policy, it describes the general phenomenon where optimizing for a metric distorts the underlying reality the metric was designed to track.

What is the Cobra Effect?

The Cobra Effect refers to unintended consequences where a solution worsens the problem it was meant to solve. The term comes from a story about colonial India, where the British government offered bounties for dead cobras to reduce the snake population. Enterprising Indians began breeding cobras to collect the bounty, increasing the population. When the program was cancelled, the bred cobras were released, making the problem worse.

How do agile development teams game velocity metrics?

Agile velocity measures story points completed per sprint. When velocity becomes a management target, teams inflate point estimates for new stories, break work into smaller chunks to show more completions, and prioritize easily completed tasks over important but complex ones. The result is that velocity numbers rise while actual output and quality stagnate or decline.

What is the difference between lead and lag indicators?

Lag indicators measure outcomes after they have occurred, such as quarterly revenue or customer churn rate. Lead indicators measure activities that predict future outcomes, such as number of prospect meetings or feature completion rate. Effective measurement systems use both: lag indicators to confirm outcomes and lead indicators to enable course correction before it is too late.

How can organizations design metrics that are harder to game?

Harder-to-game metrics tend to measure outcomes rather than activities, use multiple correlated indicators so gaming one shows up in others, are close to the ultimate goal rather than a proxy, and are periodically rotated or replaced before optimization distorts them. Qualitative measures, customer surveys, and random audits also help catch gaming that quantitative metrics miss.