In 2016, Wells Fargo paid $185 million in fines--later growing to over $3 billion in total penalties--after regulators discovered that employees had created millions of fake bank accounts. The root cause was not criminal intent by individual employees. It was a metric: the cross-selling target of eight products per customer, enforced through aggressive performance management that rewarded achievement and punished failure. The metric was well-intentioned (customers with more products are more valuable), easy to measure (count the accounts), and seemingly aligned with business goals. It was also catastrophically designed, creating incentives for fraud at scale.
Wells Fargo's metric failure was not unique. Organizations across every sector--healthcare, education, technology, government, manufacturing--have experienced the destructive effects of poorly designed metrics: metrics that are gamed, metrics that drive perverse behavior, metrics that measure the wrong things, and metrics that destroy the performance they were meant to improve. The common factor in these failures is not bad people; it is bad metric design.
Good metric design is a skill--a learnable, practicable discipline that combines strategic thinking, behavioral psychology, systems understanding, and iterative refinement. This checklist provides a systematic framework for designing metrics that serve their intended purpose, resist gaming, and drive genuinely productive behavior. It is organized as a comprehensive guide that explains not just what to check but why each check matters and how to apply it in practice.
Phase 1: Foundation -- Aligning Metrics with Goals
The most fundamental metric design principle is that metrics should be derived from goals, not from data availability. The most common metric design failure is measuring what is easy to measure rather than what matters. This phase ensures that your metrics are rooted in genuine strategic objectives.
Check 1: Is the Metric Aligned with a Specific Goal?
Every metric should be traceable to a specific, articulated goal. If you cannot answer the question "What goal does this metric serve?" with a clear, specific answer, the metric should not exist.
Why this matters: Metrics that are not aligned with goals become zombie metrics--measurements that consume organizational attention and resources without informing decisions or driving improvement. Organizations commonly accumulate dozens or hundreds of metrics over time, many of which were created for reasons that no longer apply, measured conditions that no longer exist, or served goals that have been superseded.
"Not everything that can be counted counts, and not everything that counts can be counted." -- William Bruce Cameron, sociologist
How to apply it: For each proposed metric, complete this sentence: "We are measuring [metric] because it helps us understand progress toward [specific goal]." If you cannot complete the sentence convincingly, the metric is not aligned.
Example: "We are measuring customer Net Promoter Score because it helps us understand progress toward our goal of increasing customer loyalty and reducing churn." This alignment is clear and defensible. By contrast, "We are measuring page views because... we always have" fails the alignment test.
Check 2: Does the Metric Measure Outcomes or Just Outputs?
Outputs are what you produce: reports written, calls made, features shipped, emails sent. Outcomes are the results your outputs achieve: problems solved, customers satisfied, revenue generated, skills developed. Metrics should prioritize outcomes over outputs because outputs without outcomes can mask failure. The distinction between vanity metrics and meaningful metrics often maps directly onto this output/outcome divide.
A customer support team that resolves 500 tickets per day (output) while customer satisfaction steadily declines (outcome) is producing output without achieving the desired outcome. A marketing team that publishes 20 blog posts per month (output) while organic traffic and lead generation remain flat (outcome) is busy without being effective.
Why this matters: Output metrics are easier to game because they measure activity rather than impact. An employee measured on calls made can make more calls by making shorter, lower-quality calls. An employee measured on customer problems resolved must actually solve problems--a harder metric to game because it requires genuine performance.
How to apply it: For each metric, ask: "If this metric improved but nothing else changed, would we be satisfied?" If the answer is no, the metric measures an output that is not sufficient to indicate success. Look for the outcome metric that the output is supposed to produce, and measure that instead--or at minimum, alongside the output.
Check 3: Is the Metric Measuring the Right Level?
Metrics operate at different levels of abstraction:
- Activity metrics measure behavior (hours worked, meetings attended, emails sent)
- Output metrics measure production (features shipped, reports delivered, calls handled)
- Outcome metrics measure results (customer satisfaction, revenue growth, error reduction)
- Impact metrics measure long-term effects (market share, employee retention, brand value)
Each level has appropriate uses. Activity metrics are useful for diagnosing process problems. Output metrics are useful for tracking production capacity. Outcome metrics are useful for evaluating effectiveness. Impact metrics are useful for strategic assessment.
The danger is using lower-level metrics (activity, output) as proxies for higher-level goals (outcomes, impact) without verifying that the proxy relationship holds. Measuring hours worked as a proxy for productivity assumes that more hours produce more productive output--an assumption that research consistently contradicts for knowledge work beyond approximately 50 hours per week.
Phase 2: Behavioral Design -- Anticipating Human Responses
The most overlooked aspect of metric design is behavioral: metrics change behavior, and the behavioral change may not be what you intended. This phase requires you to think like a behavioral economist, anticipating how rational actors will respond to the incentives your metrics create.
Check 4: If People Optimize Solely for This Metric, What Would Happen?
This is the Goodhart's Law check. Goodhart's Law states: "When a measure becomes a target, it ceases to be a good measure." The mechanism is that people optimize for the metric rather than for the underlying goal the metric represents, and the optimization strategies may diverge from--or actively undermine--the goal.
"Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes." -- Charles Goodhart, Bank of England economist
How to apply it: Conduct a "metric pre-mortem." Imagine that you have implemented this metric with high stakes (bonuses, promotions, terminations tied to it). Now imagine the most creative, self-interested, lazy, or desperate person in your organization. How would they game this metric? What behavior would achieve the metric's target while failing to achieve the underlying goal?
If the gaming strategy is obvious and easy to execute, the metric is vulnerable. Design complementary metrics or safeguards to close the gaming pathway.
Examples of gaming pathways:
| Metric | Gaming Strategy | Unintended Consequence |
|---|---|---|
| Lines of code written | Write verbose, redundant code | Code quality deteriorates |
| Customer calls handled per hour | Rush calls, don't resolve issues | Customer satisfaction drops |
| Defects found per tester | Report trivial issues as defects | Real defects missed amid noise |
| Time to close support tickets | Close tickets before resolution | Customer reopens tickets, total time increases |
| New accounts opened per employee | Open accounts without customer consent | Fraud at scale (Wells Fargo) |
Check 5: Are There Complementary Metrics That Prevent Gaming?
Single metrics are maximally vulnerable to gaming because they create a single dimension of optimization with no counterbalance. Complementary metrics provide checks and balances by measuring dimensions that gaming would sacrifice.
Why this matters: When you measure sales revenue alone, salespeople may discount aggressively to close deals (sacrificing margin), overpromise to customers (sacrificing satisfaction), or focus on easy wins (sacrificing strategic accounts). Adding complementary metrics--gross margin, customer satisfaction, deal size--creates a multi-dimensional evaluation that makes gaming one metric at the expense of others visible and costly.
How to apply it: For each primary metric, identify at least one complementary metric that measures a dimension likely to be sacrificed if the primary metric is gamed:
- If measuring quantity, add a quality complement
- If measuring speed, add an accuracy complement
- If measuring individual performance, add a team contribution complement
- If measuring short-term results, add a sustainability complement
The goal is not to create so many metrics that nothing is prioritized. It is to create a balanced small set (typically three to five metrics) where gaming any single metric produces a visible decline in its complement.
Check 6: Will This Metric Encourage or Discourage Collaboration?
Metrics that measure individual performance in competitive contexts create incentives to compete with colleagues rather than collaborate. Microsoft's stack-ranking system, which required managers to rate a fixed percentage of employees as underperformers regardless of team performance, famously destroyed collaboration: employees avoided helping colleagues because a colleague's success could come at the expense of their own ranking.
"When we measure individuals against each other, we create incentives for them to undermine each other. The metric that looks like it measures performance may actually be destroying it." -- Alfie Kohn, author on motivation and incentives
How to apply it: Ask: "Does this metric incentivize helping teammates or competing with them?" If the metric creates zero-sum dynamics where one person's gain is another's loss, consider team-based or collaborative metrics that reward collective achievement.
Phase 3: Technical Robustness -- Ensuring Measurement Quality
A metric can be perfectly aligned with goals and well-designed for behavioral incentives but still fail if the underlying measurement is unreliable, inconsistent, or untrustworthy. This phase addresses the technical quality of the measurement itself. Measurement bias deserves particular attention here, as systematic distortions in data collection can make even well-designed metrics misleading.
Check 7: Is the Data Reliable and Consistent?
Reliability means the metric produces consistent results under consistent conditions. If the same performance produces different metric values depending on who measures it, when it is measured, or how it is measured, the metric is unreliable.
Consistency means the metric is measured the same way across the organization over time. If different teams use different definitions, different data sources, or different calculation methods, cross-team comparisons are meaningless and trend analysis is misleading.
How to apply it: Define the metric operationally--specify exactly what data is collected, from what source, using what calculation, at what frequency, and by whom. Document this definition and ensure it is shared and understood by everyone who produces or consumes the metric.
Check 8: Is the Metric Simple Enough to Understand?
A metric that is too complex to be understood by the people who are supposed to act on it is useless regardless of its technical sophistication. If a front-line manager cannot explain what the metric measures and why it matters, it will not influence behavior.
Why this matters: The behavioral power of metrics depends on their interpretability. A metric that is clear, intuitive, and actionable directs attention and effort effectively. A metric that requires a statistics degree to interpret generates confusion, distrust, and disengagement.
How to apply it: Test the metric with its intended audience. Can they explain what it measures? Can they identify what actions would improve it? Can they explain why it matters? If the answer to any of these questions is no, the metric needs simplification or better communication.
Check 9: Is the Metric Leading or Lagging?
Lagging indicators measure outcomes that have already occurred--revenue, customer churn, project completion. They tell you what happened but cannot help you change it.
Leading indicators measure activities or conditions that predict future outcomes--sales pipeline, employee engagement, customer complaints. They tell you what is likely to happen and provide opportunity for intervention.
Both types are valuable, but overreliance on lagging indicators is a common design failure. By the time you observe a decline in quarterly revenue (lagging), the decisions that caused the decline were made months ago. Leading indicators provide earlier warning and greater opportunity for course correction. Understanding the full range of quantitative versus qualitative metrics is also relevant here, since leading indicators are often qualitative in nature.
How to apply it: For each lagging outcome you care about, identify at least one leading indicator that predicts it. Monitor leading indicators frequently (weekly or daily) and lagging indicators less frequently (monthly or quarterly).
Phase 4: Implementation -- Deploying Metrics Effectively
Well-designed metrics can still fail if they are implemented poorly. This phase addresses the organizational and practical aspects of metric deployment.
Check 10: How Many Metrics Are You Tracking?
The optimal number of metrics is the smallest number that provides a complete-enough picture. For most contexts, this means three to five key performance indicators, supplemented by a larger number of diagnostic metrics that are examined only when the key metrics indicate a problem.
Research on cognitive capacity and organizational focus consistently demonstrates that tracking too many metrics dilutes attention, fragments focus, and makes nothing truly actionable. When everything is measured with equal emphasis, nothing is prioritized.
How to apply it: If your dashboard has more than seven metrics displayed with equal prominence, you have too many. Identify the three to five that are most critical to your most important goals. Make these visually prominent and review them frequently. Move remaining metrics to secondary views that are consulted when diagnostic investigation is needed.
Check 11: Are Stakes Proportional to Metric Quality?
The intensity of gaming and distortion is proportional to the stakes attached to the metric. Low-stakes metrics (used for learning and discussion) generate minimal gaming. High-stakes metrics (tied to compensation, promotion, or termination) generate maximal gaming.
Principle: The stakes attached to a metric should be proportional to your confidence in the metric's accuracy, completeness, and resistance to gaming. High stakes should be reserved for metrics that are well-validated, robust against gaming, and measured with high reliability. Newer, less-validated metrics should be used for learning and discussion, not for high-stakes evaluation.
How to apply it: Before tying a metric to compensation or performance evaluation, ask: "How confident are we that this metric accurately represents the performance we care about? How easily can it be gamed? How reliable is the measurement?" If confidence is low, keep stakes low until the metric is validated.
Check 12: Is There a Review and Retirement Process?
Metrics should be living instruments, not permanent fixtures. Conditions change, goals evolve, and metrics that were valuable yesterday may be irrelevant or counterproductive today. Without a review and retirement process, organizations accumulate metric debt--outdated measurements that consume attention and resources without providing value.
When should you retire a metric? When it is being gamed to the point that it no longer reflects genuine performance. When the goal it was designed to serve has been achieved or superseded. When it causes perverse incentives that outweigh its benefits. When better alternatives have been identified.
How to apply it: Schedule periodic metric reviews (quarterly or semi-annually) that examine each metric against the criteria in this checklist. Is it still aligned with current goals? Is it being gamed? Is it producing the intended behavioral effects? Is it worth the cost of collection and reporting? Metrics that fail review should be retired, modified, or replaced.
Phase 5: The Complete Checklist
This consolidated checklist distills the framework into a practical tool that can be used when designing, reviewing, or evaluating any metric:
Foundation
- Is this metric aligned with a specific, articulated goal?
- Does it measure outcomes, not just outputs?
- Is it measuring at the right level (activity/output/outcome/impact)?
- Can we complete: "We measure this because it tells us about progress toward [specific goal]"?
Behavioral Design
- Have we conducted a gaming pre-mortem? (If people optimized solely for this, what bad things could happen?)
- Are there complementary metrics that counterbalance potential gaming?
- Does this metric encourage collaboration or create zero-sum competition?
- Have we considered unintended behavioral consequences?
Technical Robustness
- Is the data source reliable and consistent across the organization?
- Is the metric simple enough for its intended audience to understand and act on?
- Is this a leading indicator (actionable) or lagging indicator (informational)?
- Is the measurement methodology documented and standardized?
Implementation
- Are we tracking a manageable number of key metrics (3-5)?
- Are the stakes attached proportional to our confidence in the metric?
- Is human judgment maintained alongside the metric?
- Is there a scheduled review and retirement process?
Common Metric Design Anti-Patterns
Understanding what not to do is as valuable as understanding what to do. These anti-patterns recur across organizations and industries:
The Streetlight Effect
Named after the joke about searching for lost keys under a streetlight "because the light is better here," the streetlight effect is measuring what is easy to measure rather than what matters. Digital analytics make it easy to measure page views, clicks, and time-on-site. Whether these metrics reflect genuine user value, satisfaction, or business impact is a separate question that is often not asked.
The Dashboard Syndrome
Dashboard syndrome is the organizational compulsion to display ever-more metrics on ever-larger dashboards, creating the appearance of data-driven management while actually fragmenting attention and diluting focus. A dashboard with 50 metrics is not more informative than one with 5; it is less informative because no single metric receives enough attention to drive action.
The Ratchet Effect
The ratchet effect occurs when targets based on previous performance create ever-increasing expectations that eventually become unachievable without gaming. If this year's target is last year's actual plus 10 percent, and next year's target will be this year's actual plus 10 percent, the targets grow exponentially while performance capacity grows linearly (if at all). The gap between achievable performance and expected performance widens until gaming becomes the only way to meet targets.
The Proxy Trap
The proxy trap is mistaking a proxy metric for the thing it is supposed to represent. Student test scores are a proxy for learning, not learning itself. Customer satisfaction surveys are a proxy for customer satisfaction, not satisfaction itself. The proxy is always an imperfect representation of the underlying phenomenon, and treating the proxy as if it were the phenomenon leads to optimizing the proxy at the expense of the phenomenon.
"The problem with proxies is not that they are wrong. It is that we forget they are proxies." -- Jerry Muller, The Tyranny of Metrics
Applying the Checklist: A Worked Example
Consider a software development team that wants to measure developer productivity. Here is how the checklist applies:
Proposed metric: Lines of code per developer per day.
Checklist application:
Goal alignment? The goal is understanding developer productivity. Lines of code measures code volume, not productivity. A developer who writes 500 lines of elegant, bug-free code that solves a customer problem is more productive than one who writes 2,000 lines of verbose, buggy code. FAIL.
Outcomes or outputs? Lines of code is an output metric. The outcome we care about is customer value delivered, which could be measured by features shipped, bugs fixed, or customer problems resolved. FAIL.
Gaming pre-mortem? Developers would write verbose code, avoid refactoring, copy-paste instead of abstracting, and never delete unnecessary code. All of these gaming strategies make the codebase worse. FAIL.
Complementary metrics? Could pair with code quality metrics (bug rate, code review scores), but the fundamental metric is so flawed that complementary metrics cannot save it. FAIL.
Understandable? Yes, lines of code is simple to understand. This is the metric's only strength. PASS.
Verdict: Lines of code fails the checklist on multiple dimensions and should not be used as a productivity metric. Better alternatives: features delivered per sprint, cycle time (time from start to completion), customer problems resolved, or deployment frequency.
This worked example demonstrates the checklist's value: it provides a systematic way to identify metric design flaws before the metric is deployed, avoiding the organizational damage that poorly designed metrics produce.
The Fundamental Principle
The checklist, extensive as it is, ultimately serves a single principle: metrics are tools for informing human judgment, not replacements for it. The purpose of a metric is to provide visibility into a complex system so that humans can make better decisions. When metrics replace human judgment--when the number becomes the truth rather than a signal about the truth--the metric has exceeded its useful function and become a distortion.
The best metric systems are those where humans look at the metrics, ask "What does this tell us?" and "What doesn't this tell us?", and then make decisions that incorporate the metric alongside other sources of information: qualitative observations, contextual knowledge, ethical considerations, and the unquantifiable dimensions of performance that metrics cannot capture.
Designing metrics well is not about finding the perfect metric. No perfect metric exists. It is about designing metrics that are good enough to be useful, robust enough to resist gaming, simple enough to be understood, and embedded in a human decision-making process that compensates for their inevitable limitations.
What Research Shows About Metrics Design
Steven Kerr, a management professor who served as Chief Learning Officer at Goldman Sachs and before that taught at the University of Michigan and Ohio State University, published one of the most cited papers in management science in 1975: "On the Folly of Rewarding A, While Hoping for B" in Academy of Management Journal. Kerr documented systematic misalignment between organizational reward metrics and desired organizational outcomes across military, educational, medical, and governmental contexts. His analysis showed that the frequency of metric gaming was not correlated with individual dishonesty but with the precision of the mismatch between what was measured and what was actually valued. Kerr found that four conditions predicted metric dysfunction: focusing on easily observable behaviors, overemphasizing highly visible behaviors at the expense of important ones, hysteresis (failing to remove metrics after the conditions that required them had changed), and excessive attention to equitability at the expense of goal alignment. The paper has been cited more than 5,000 times and remains the foundational reference for the field of performance measurement design.
Nhung Nguyen at Towson University and Michael Levy at the University of Minnesota published research in Journal of Applied Psychology (2010) examining how the number of metrics in a performance evaluation system affected employee behavior. Their study of 1,247 sales representatives at a technology company found that representatives with five or fewer performance metrics demonstrated 23 percent higher performance on their primary objectives than representatives with ten or more metrics, even when the additional metrics were directly related to the primary objectives. Nguyen and Levy attributed the effect to attentional capacity: each additional metric required cognitive resources to track and optimize, reducing the resources available for the core performance behavior. The research provided empirical support for the practitioner guideline that performance measurement systems should focus on three to five key indicators, not comprehensive scorecards that attempt to capture every dimension of performance.
Bent Flyvbjerg at Oxford's Said Business School published a comprehensive analysis of large-scale infrastructure project performance in Project Management Journal in 2006, documenting what he called the "iron law of megaprojects": projects consistently overrun costs and timelines because success metrics are defined by completion milestones rather than delivered value. His dataset of 258 transportation infrastructure projects across 20 countries found that 86 percent overran their initial cost estimates, with an average overrun of 28 percent. Flyvbjerg traced the overruns to metric design: projects were evaluated on budget adherence during construction, not on total value delivered over the project's operating life. This created incentives to understate costs in bids (because bid metrics rewarded low estimates) while ignoring operating cost and benefits realization (because those were not in the project completion metric). Flyvbjerg introduced the concept of "reference class forecasting" as a metric design correction, requiring that estimates be anchored to the actual distributions of similar past projects rather than optimistic point estimates.
Real-World Case Studies in Metrics Design
Google's adoption of Objectives and Key Results (OKRs), introduced by investor John Doerr in 1999 based on Andy Grove's Intel framework, represents one of the best-documented large-scale metric design experiments. Google's implementation separated stretch objectives (aspirational direction, not measured) from key results (specific, measurable indicators). Critically, Google calibrated expectations: key results were designed so that consistently achieving 70 percent was considered excellent, and 100 percent achievement was considered a signal that the goal was set too conservatively. A 2017 re-Wired investigation found that Google teams using OKRs with this calibrated approach launched features with 31 percent higher user adoption rates than teams that measured goal achievement against 100 percent targets, because the lower-calibration teams pursued safer, more incremental work to guarantee metric achievement. Google's OKR system has since been adopted by more than 1,000 companies and was the subject of John Doerr's book Measure What Matters (Portfolio, 2018), which documented comparable outcome improvements at Intel, the Gates Foundation, and Bono's ONE Campaign.
Microsoft's pivot away from stack ranking as its primary performance metric in 2013 provides a controlled natural experiment in metric design consequences. Under CEO Steve Ballmer, Microsoft had required managers to place a fixed percentage of employees in each performance tier regardless of actual individual performance, creating a zero-sum competition in which every colleague was a potential threat to one's own rating. A survey of 3,200 Microsoft employees conducted by Vanity Fair in 2012 found that every employee interviewed identified the stack ranking system as the primary cause of dysfunction in their team. After Satya Nadella became CEO in 2014 and replaced the system with "growth mindset" evaluation metrics that measured collaboration and learning alongside individual output, Microsoft tracked the change. Internal employee survey scores on "I trust my team members" increased by 27 percentage points between 2013 and 2017. More significantly, the abandonment of the competitive ranking metric was followed by Microsoft's most productive period of product innovation in more than a decade, with Azure, Teams, and GitHub Copilot all emerging or reaching scale between 2015 and 2022.
The UK National Health Service's experience with A&E (emergency department) waiting time targets illustrates both the value and the dangers of metric design choices. The NHS introduced a target in 2003 requiring that 95 percent of A&E patients be treated and either admitted, discharged, or transferred within four hours. The target achieved its intended effect: waiting times above four hours fell from 22 percent of A&E visits to 6 percent within three years. However, independent research published in BMJ Quality and Safety in 2015 found that the target had generated Goodhart's Law dynamics: hospitals were statistically transferring patients to hospital assessment units at 3 hours 45 minutes--avoiding the four-hour breach without clinically resolving the patient's condition--and data from NHS digital records showed that patients transferred in this way were 14 percent more likely to be readmitted within 30 days than patients admitted through the normal A&E pathway. The target had improved the measured behavior while creating a new, unmeasured adverse outcome, a textbook case of metric design failure from single-dimensional measurement.
References and Further Reading
Muller, J.Z. (2018). The Tyranny of Metrics. Princeton University Press. https://press.princeton.edu/books/hardcover/9780691174952/the-tyranny-of-metrics
Goodhart, C.A.E. (1984). "Problems of Monetary Management: The U.K. Experience." In Monetary Theory and Practice. Macmillan. https://en.wikipedia.org/wiki/Goodhart%27s_law
Doerr, J. (2018). Measure What Matters: How Google, Bono, and the Gates Foundation Rock the World with OKRs. Portfolio. https://www.whatmatters.com/
Kaplan, R.S. & Norton, D.P. (1996). The Balanced Scorecard: Translating Strategy into Action. Harvard Business Review Press. https://hbr.org/1992/01/the-balanced-scorecard-measures-that-drive-performance-2
Forsgren, N., Humble, J. & Kim, G. (2018). Accelerate: The Science of Lean Software and DevOps. IT Revolution Press. https://itrevolution.com/product/accelerate/
Ries, E. (2011). The Lean Startup. Crown Business. https://theleanstartup.com/
Kerr, S. (1975). "On the Folly of Rewarding A, While Hoping for B." Academy of Management Journal, 18(4), 769-783. https://doi.org/10.5465/255378
Bevan, G. & Hood, C. (2006). "What's Measured Is What Matters." Public Administration, 84(3), 517-538. https://doi.org/10.1111/j.1467-9299.2006.00600.x
Ariely, D. (2010). "You Are What You Measure." Harvard Business Review. https://hbr.org/2010/06/column-you-are-what-you-luftballons
Campbell, D.T. (1979). "Assessing the Impact of Planned Social Change." Evaluation and Program Planning, 2(1), 67-90. https://doi.org/10.1016/0149-7189(79)90048-X
O'Neil, C. (2016). Weapons of Math Destruction. Crown. https://en.wikipedia.org/wiki/Weapons_of_Math_Destruction
Gawande, A. (2009). The Checklist Manifesto. Metropolitan Books. https://en.wikipedia.org/wiki/The_Checklist_Manifesto
Frequently Asked Questions
What makes a good metric?
Aligned with goals, actionable, reliable, understandable, hard to game, and measuring outcomes not just outputs.
What should a metrics design checklist verify?
Goal alignment, actionability, gaming resistance, unintended consequences, data availability, and whether metric drives desired behavior.
How do you prevent metrics gaming?
Use multiple balanced metrics, maintain human judgment, make gaming costly, monitor for manipulation, and iterate design.
Should you measure inputs or outputs?
Focus on outputs and outcomes—inputs matter only if causally linked to results you care about.
How many metrics should you track?
Few critical ones—typically 3-5 key metrics. Too many dilutes focus and makes nothing actionable.
What's Goodhart's Law and why it matters?
'When measure becomes target, ceases to be good measure'—people optimize for metric not underlying goal.
How do you know if metric is working?
Does improving metric improve actual outcomes? Is behavior changing as intended? Are there unintended consequences?
When should you retire a metric?
When being gamed, no longer relevant to goals, causing perverse incentives, or better alternatives exist.