Metrics Design Checklist: A Comprehensive Guide to Creating Measurements That Actually Work, Resist Gaming, and Drive the Right Behavior

In 2016, Wells Fargo paid $185 million in fines--later growing to over $3 billion in total penalties--after regulators discovered that employees had created millions of fake bank accounts. The root cause was not criminal intent by individual employees. It was a metric: the cross-selling target of eight products per customer, enforced through aggressive performance management that rewarded achievement and punished failure. The metric was well-intentioned (customers with more products are more valuable), easy to measure (count the accounts), and seemingly aligned with business goals. It was also catastrophically designed, creating incentives for fraud at scale.

Wells Fargo's metric failure was not unique. Organizations across every sector--healthcare, education, technology, government, manufacturing--have experienced the destructive effects of poorly designed metrics: metrics that are gamed, metrics that drive perverse behavior, metrics that measure the wrong things, and metrics that destroy the performance they were meant to improve. The common factor in these failures is not bad people; it is bad metric design.

Good metric design is a skill--a learnable, practicable discipline that combines strategic thinking, behavioral psychology, systems understanding, and iterative refinement. This checklist provides a systematic framework for designing metrics that serve their intended purpose, resist gaming, and drive genuinely productive behavior. It is organized as a comprehensive guide that explains not just what to check but why each check matters and how to apply it in practice.


Phase 1: Foundation -- Aligning Metrics with Goals

The most fundamental metric design principle is that metrics should be derived from goals, not from data availability. The most common metric design failure is measuring what is easy to measure rather than what matters. This phase ensures that your metrics are rooted in genuine strategic objectives.

Check 1: Is the Metric Aligned with a Specific Goal?

Every metric should be traceable to a specific, articulated goal. If you cannot answer the question "What goal does this metric serve?" with a clear, specific answer, the metric should not exist.

Why this matters: Metrics that are not aligned with goals become zombie metrics--measurements that consume organizational attention and resources without informing decisions or driving improvement. Organizations commonly accumulate dozens or hundreds of metrics over time, many of which were created for reasons that no longer apply, measured conditions that no longer exist, or served goals that have been superseded.

How to apply it: For each proposed metric, complete this sentence: "We are measuring [metric] because it helps us understand progress toward [specific goal]." If you cannot complete the sentence convincingly, the metric is not aligned.

Example: "We are measuring customer Net Promoter Score because it helps us understand progress toward our goal of increasing customer loyalty and reducing churn." This alignment is clear and defensible. By contrast, "We are measuring page views because... we always have" fails the alignment test.

Check 2: Does the Metric Measure Outcomes or Just Outputs?

Outputs are what you produce: reports written, calls made, features shipped, emails sent. Outcomes are the results your outputs achieve: problems solved, customers satisfied, revenue generated, skills developed. Metrics should prioritize outcomes over outputs because outputs without outcomes can mask failure.

A customer support team that resolves 500 tickets per day (output) while customer satisfaction steadily declines (outcome) is producing output without achieving the desired outcome. A marketing team that publishes 20 blog posts per month (output) while organic traffic and lead generation remain flat (outcome) is busy without being effective.

Why this matters: Output metrics are easier to game because they measure activity rather than impact. An employee measured on calls made can make more calls by making shorter, lower-quality calls. An employee measured on customer problems resolved must actually solve problems--a harder metric to game because it requires genuine performance.

How to apply it: For each metric, ask: "If this metric improved but nothing else changed, would we be satisfied?" If the answer is no, the metric measures an output that is not sufficient to indicate success. Look for the outcome metric that the output is supposed to produce, and measure that instead--or at minimum, alongside the output.

Check 3: Is the Metric Measuring the Right Level?

Metrics operate at different levels of abstraction:

  • Activity metrics measure behavior (hours worked, meetings attended, emails sent)
  • Output metrics measure production (features shipped, reports delivered, calls handled)
  • Outcome metrics measure results (customer satisfaction, revenue growth, error reduction)
  • Impact metrics measure long-term effects (market share, employee retention, brand value)

Each level has appropriate uses. Activity metrics are useful for diagnosing process problems. Output metrics are useful for tracking production capacity. Outcome metrics are useful for evaluating effectiveness. Impact metrics are useful for strategic assessment.

The danger is using lower-level metrics (activity, output) as proxies for higher-level goals (outcomes, impact) without verifying that the proxy relationship holds. Measuring hours worked as a proxy for productivity assumes that more hours produce more productive output--an assumption that research consistently contradicts for knowledge work beyond approximately 50 hours per week.


Phase 2: Behavioral Design -- Anticipating Human Responses

The most overlooked aspect of metric design is behavioral: metrics change behavior, and the behavioral change may not be what you intended. This phase requires you to think like a behavioral economist, anticipating how rational actors will respond to the incentives your metrics create.

Check 4: If People Optimize Solely for This Metric, What Would Happen?

This is the Goodhart's Law check. Goodhart's Law states: "When a measure becomes a target, it ceases to be a good measure." The mechanism is that people optimize for the metric rather than for the underlying goal the metric represents, and the optimization strategies may diverge from--or actively undermine--the goal.

How to apply it: Conduct a "metric pre-mortem." Imagine that you have implemented this metric with high stakes (bonuses, promotions, terminations tied to it). Now imagine the most creative, self-interested, lazy, or desperate person in your organization. How would they game this metric? What behavior would achieve the metric's target while failing to achieve the underlying goal?

If the gaming strategy is obvious and easy to execute, the metric is vulnerable. Design complementary metrics or safeguards to close the gaming pathway.

Examples of gaming pathways:

Metric Gaming Strategy Unintended Consequence
Lines of code written Write verbose, redundant code Code quality deteriorates
Customer calls handled per hour Rush calls, don't resolve issues Customer satisfaction drops
Defects found per tester Report trivial issues as defects Real defects missed amid noise
Time to close support tickets Close tickets before resolution Customer reopens tickets, total time increases
New accounts opened per employee Open accounts without customer consent Fraud at scale (Wells Fargo)

Check 5: Are There Complementary Metrics That Prevent Gaming?

Single metrics are maximally vulnerable to gaming because they create a single dimension of optimization with no counterbalance. Complementary metrics provide checks and balances by measuring dimensions that gaming would sacrifice.

Why this matters: When you measure sales revenue alone, salespeople may discount aggressively to close deals (sacrificing margin), overpromise to customers (sacrificing satisfaction), or focus on easy wins (sacrificing strategic accounts). Adding complementary metrics--gross margin, customer satisfaction, deal size--creates a multi-dimensional evaluation that makes gaming one metric at the expense of others visible and costly.

How to apply it: For each primary metric, identify at least one complementary metric that measures a dimension likely to be sacrificed if the primary metric is gamed:

  • If measuring quantity, add a quality complement
  • If measuring speed, add an accuracy complement
  • If measuring individual performance, add a team contribution complement
  • If measuring short-term results, add a sustainability complement

The goal is not to create so many metrics that nothing is prioritized. It is to create a balanced small set (typically three to five metrics) where gaming any single metric produces a visible decline in its complement.

Check 6: Will This Metric Encourage or Discourage Collaboration?

Metrics that measure individual performance in competitive contexts create incentives to compete with colleagues rather than collaborate. Microsoft's stack-ranking system, which required managers to rate a fixed percentage of employees as underperformers regardless of team performance, famously destroyed collaboration: employees avoided helping colleagues because a colleague's success could come at the expense of their own ranking.

How to apply it: Ask: "Does this metric incentivize helping teammates or competing with them?" If the metric creates zero-sum dynamics where one person's gain is another's loss, consider team-based or collaborative metrics that reward collective achievement.


Phase 3: Technical Robustness -- Ensuring Measurement Quality

A metric can be perfectly aligned with goals and well-designed for behavioral incentives but still fail if the underlying measurement is unreliable, inconsistent, or untrustworthy. This phase addresses the technical quality of the measurement itself.

Check 7: Is the Data Reliable and Consistent?

Reliability means the metric produces consistent results under consistent conditions. If the same performance produces different metric values depending on who measures it, when it is measured, or how it is measured, the metric is unreliable.

Consistency means the metric is measured the same way across the organization over time. If different teams use different definitions, different data sources, or different calculation methods, cross-team comparisons are meaningless and trend analysis is misleading.

How to apply it: Define the metric operationally--specify exactly what data is collected, from what source, using what calculation, at what frequency, and by whom. Document this definition and ensure it is shared and understood by everyone who produces or consumes the metric.

Check 8: Is the Metric Simple Enough to Understand?

A metric that is too complex to be understood by the people who are supposed to act on it is useless regardless of its technical sophistication. If a front-line manager cannot explain what the metric measures and why it matters, it will not influence behavior.

Why this matters: The behavioral power of metrics depends on their interpretability. A metric that is clear, intuitive, and actionable directs attention and effort effectively. A metric that requires a statistics degree to interpret generates confusion, distrust, and disengagement.

How to apply it: Test the metric with its intended audience. Can they explain what it measures? Can they identify what actions would improve it? Can they explain why it matters? If the answer to any of these questions is no, the metric needs simplification or better communication.

Check 9: Is the Metric Leading or Lagging?

Lagging indicators measure outcomes that have already occurred--revenue, customer churn, project completion. They tell you what happened but cannot help you change it.

Leading indicators measure activities or conditions that predict future outcomes--sales pipeline, employee engagement, customer complaints. They tell you what is likely to happen and provide opportunity for intervention.

Both types are valuable, but overreliance on lagging indicators is a common design failure. By the time you observe a decline in quarterly revenue (lagging), the decisions that caused the decline were made months ago. Leading indicators provide earlier warning and greater opportunity for course correction.

How to apply it: For each lagging outcome you care about, identify at least one leading indicator that predicts it. Monitor leading indicators frequently (weekly or daily) and lagging indicators less frequently (monthly or quarterly).


Phase 4: Implementation -- Deploying Metrics Effectively

Well-designed metrics can still fail if they are implemented poorly. This phase addresses the organizational and practical aspects of metric deployment.

Check 10: How Many Metrics Are You Tracking?

The optimal number of metrics is the smallest number that provides a complete-enough picture. For most contexts, this means three to five key metrics, supplemented by a larger number of diagnostic metrics that are examined only when the key metrics indicate a problem.

Research on cognitive capacity and organizational focus consistently demonstrates that tracking too many metrics dilutes attention, fragments focus, and makes nothing truly actionable. When everything is measured with equal emphasis, nothing is prioritized.

How to apply it: If your dashboard has more than seven metrics displayed with equal prominence, you have too many. Identify the three to five that are most critical to your most important goals. Make these visually prominent and review them frequently. Move remaining metrics to secondary views that are consulted when diagnostic investigation is needed.

Check 11: Are Stakes Proportional to Metric Quality?

The intensity of gaming and distortion is proportional to the stakes attached to the metric. Low-stakes metrics (used for learning and discussion) generate minimal gaming. High-stakes metrics (tied to compensation, promotion, or termination) generate maximal gaming.

Principle: The stakes attached to a metric should be proportional to your confidence in the metric's accuracy, completeness, and resistance to gaming. High stakes should be reserved for metrics that are well-validated, robust against gaming, and measured with high reliability. Newer, less-validated metrics should be used for learning and discussion, not for high-stakes evaluation.

How to apply it: Before tying a metric to compensation or performance evaluation, ask: "How confident are we that this metric accurately represents the performance we care about? How easily can it be gamed? How reliable is the measurement?" If confidence is low, keep stakes low until the metric is validated.

Check 12: Is There a Review and Retirement Process?

Metrics should be living instruments, not permanent fixtures. Conditions change, goals evolve, and metrics that were valuable yesterday may be irrelevant or counterproductive today. Without a review and retirement process, organizations accumulate metric debt--outdated measurements that consume attention and resources without providing value.

When should you retire a metric? When it is being gamed to the point that it no longer reflects genuine performance. When the goal it was designed to serve has been achieved or superseded. When it causes perverse incentives that outweigh its benefits. When better alternatives have been identified.

How to apply it: Schedule periodic metric reviews (quarterly or semi-annually) that examine each metric against the criteria in this checklist. Is it still aligned with current goals? Is it being gamed? Is it producing the intended behavioral effects? Is it worth the cost of collection and reporting? Metrics that fail review should be retired, modified, or replaced.


Phase 5: The Complete Checklist

This consolidated checklist distills the framework into a practical tool that can be used when designing, reviewing, or evaluating any metric:

Foundation

  • Is this metric aligned with a specific, articulated goal?
  • Does it measure outcomes, not just outputs?
  • Is it measuring at the right level (activity/output/outcome/impact)?
  • Can we complete: "We measure this because it tells us about progress toward [specific goal]"?

Behavioral Design

  • Have we conducted a gaming pre-mortem? (If people optimized solely for this, what bad things could happen?)
  • Are there complementary metrics that counterbalance potential gaming?
  • Does this metric encourage collaboration or create zero-sum competition?
  • Have we considered unintended behavioral consequences?

Technical Robustness

  • Is the data source reliable and consistent across the organization?
  • Is the metric simple enough for its intended audience to understand and act on?
  • Is this a leading indicator (actionable) or lagging indicator (informational)?
  • Is the measurement methodology documented and standardized?

Implementation

  • Are we tracking a manageable number of key metrics (3-5)?
  • Are the stakes attached proportional to our confidence in the metric?
  • Is human judgment maintained alongside the metric?
  • Is there a scheduled review and retirement process?

Common Metric Design Anti-Patterns

Understanding what not to do is as valuable as understanding what to do. These anti-patterns recur across organizations and industries:

The Streetlight Effect

Named after the joke about searching for lost keys under a streetlight "because the light is better here," the streetlight effect is measuring what is easy to measure rather than what matters. Digital analytics make it easy to measure page views, clicks, and time-on-site. Whether these metrics reflect genuine user value, satisfaction, or business impact is a separate question that is often not asked.

The Dashboard Syndrome

Dashboard syndrome is the organizational compulsion to display ever-more metrics on ever-larger dashboards, creating the appearance of data-driven management while actually fragmenting attention and diluting focus. A dashboard with 50 metrics is not more informative than one with 5; it is less informative because no single metric receives enough attention to drive action.

The Ratchet Effect

The ratchet effect occurs when targets based on previous performance create ever-increasing expectations that eventually become unachievable without gaming. If this year's target is last year's actual plus 10 percent, and next year's target will be this year's actual plus 10 percent, the targets grow exponentially while performance capacity grows linearly (if at all). The gap between achievable performance and expected performance widens until gaming becomes the only way to meet targets.

The Proxy Trap

The proxy trap is mistaking a proxy metric for the thing it is supposed to represent. Student test scores are a proxy for learning, not learning itself. Customer satisfaction surveys are a proxy for customer satisfaction, not satisfaction itself. The proxy is always an imperfect representation of the underlying phenomenon, and treating the proxy as if it were the phenomenon leads to optimizing the proxy at the expense of the phenomenon.


Applying the Checklist: A Worked Example

Consider a software development team that wants to measure developer productivity. Here is how the checklist applies:

Proposed metric: Lines of code per developer per day.

Checklist application:

  1. Goal alignment? The goal is understanding developer productivity. Lines of code measures code volume, not productivity. A developer who writes 500 lines of elegant, bug-free code that solves a customer problem is more productive than one who writes 2,000 lines of verbose, buggy code. FAIL.

  2. Outcomes or outputs? Lines of code is an output metric. The outcome we care about is customer value delivered, which could be measured by features shipped, bugs fixed, or customer problems resolved. FAIL.

  3. Gaming pre-mortem? Developers would write verbose code, avoid refactoring, copy-paste instead of abstracting, and never delete unnecessary code. All of these gaming strategies make the codebase worse. FAIL.

  4. Complementary metrics? Could pair with code quality metrics (bug rate, code review scores), but the fundamental metric is so flawed that complementary metrics cannot save it. FAIL.

  5. Understandable? Yes, lines of code is simple to understand. This is the metric's only strength. PASS.

Verdict: Lines of code fails the checklist on multiple dimensions and should not be used as a productivity metric. Better alternatives: features delivered per sprint, cycle time (time from start to completion), customer problems resolved, or deployment frequency.

This worked example demonstrates the checklist's value: it provides a systematic way to identify metric design flaws before the metric is deployed, avoiding the organizational damage that poorly designed metrics produce.


The Fundamental Principle

The checklist, extensive as it is, ultimately serves a single principle: metrics are tools for informing human judgment, not replacements for it. The purpose of a metric is to provide visibility into a complex system so that humans can make better decisions. When metrics replace human judgment--when the number becomes the truth rather than a signal about the truth--the metric has exceeded its useful function and become a distortion.

The best metric systems are those where humans look at the metrics, ask "What does this tell us?" and "What doesn't this tell us?", and then make decisions that incorporate the metric alongside other sources of information: qualitative observations, contextual knowledge, ethical considerations, and the unquantifiable dimensions of performance that metrics cannot capture.

Designing metrics well is not about finding the perfect metric. No perfect metric exists. It is about designing metrics that are good enough to be useful, robust enough to resist gaming, simple enough to be understood, and embedded in a human decision-making process that compensates for their inevitable limitations.


References and Further Reading

  1. Muller, J.Z. (2018). The Tyranny of Metrics. Princeton University Press. https://press.princeton.edu/books/hardcover/9780691174952/the-tyranny-of-metrics

  2. Goodhart, C.A.E. (1984). "Problems of Monetary Management: The U.K. Experience." In Monetary Theory and Practice. Macmillan. https://en.wikipedia.org/wiki/Goodhart%27s_law

  3. Doerr, J. (2018). Measure What Matters: How Google, Bono, and the Gates Foundation Rock the World with OKRs. Portfolio. https://www.whatmatters.com/

  4. Kaplan, R.S. & Norton, D.P. (1996). The Balanced Scorecard: Translating Strategy into Action. Harvard Business Review Press. https://hbr.org/1992/01/the-balanced-scorecard-measures-that-drive-performance-2

  5. Forsgren, N., Humble, J. & Kim, G. (2018). Accelerate: The Science of Lean Software and DevOps. IT Revolution Press. https://itrevolution.com/product/accelerate/

  6. Ries, E. (2011). The Lean Startup. Crown Business. https://theleanstartup.com/

  7. Kerr, S. (1975). "On the Folly of Rewarding A, While Hoping for B." Academy of Management Journal, 18(4), 769-783. https://doi.org/10.5465/255378

  8. Bevan, G. & Hood, C. (2006). "What's Measured Is What Matters." Public Administration, 84(3), 517-538. https://doi.org/10.1111/j.1467-9299.2006.00600.x

  9. Ariely, D. (2010). "You Are What You Measure." Harvard Business Review. https://hbr.org/2010/06/column-you-are-what-you-luftballons

  10. Campbell, D.T. (1979). "Assessing the Impact of Planned Social Change." Evaluation and Program Planning, 2(1), 67-90. https://doi.org/10.1016/0149-7189(79)90048-X

  11. O'Neil, C. (2016). Weapons of Math Destruction. Crown. https://en.wikipedia.org/wiki/Weapons_of_Math_Destruction

  12. Gawande, A. (2009). The Checklist Manifesto. Metropolitan Books. https://en.wikipedia.org/wiki/The_Checklist_Manifesto