The Measurement Trap: Why Most Engineering Metrics Make Things Worse

Mar 26, 2026 Ragan McGill Better

The Measurement Trap: Why Most Engineering Metrics Make Things Worse

There is a seductive logic to measurement: if you can see it, you can manage it. If you can manage it, you can improve it. Metrics are therefore good. More metrics are better. Comprehensive dashboards are best.

This logic is wrong - not occasionally, but systematically. And engineering organisations that follow it uncritically end up with measurement systems that actively damage the things they're trying to improve.

Here's what's actually happening, and what to do instead.

Goodhart's Law Is Running Your Engineering Organisation

Charles Goodhart, a British economist, observed in the 1970s that any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes. Marilyn Strathern later sharpened this into the form most people know: when a measure becomes a target, it ceases to be a good measure.

This is not a theoretical concern. It is the operating condition of most engineering measurement programmes.

Velocity targets produce inflated estimates. Code coverage targets produce tests that execute code without asserting anything meaningful. Deployment frequency targets produce meaningless releases - empty commits, documentation-only changes, whitespace updates - that satisfy the metric without improving the system. Bug count targets produce reclassified bugs.

The metric is gamed not because engineers are dishonest, but because organisations have created a system where gaming the metric is rational. The responsibility for this is with the people who designed the measurement system, not the people responding to it.

The Proxy Problem

Every engineering metric is a proxy. It measures something that correlates with what you actually care about - quality, speed, reliability, customer value - but is not the thing itself.

Proxies are useful when the correlation holds. They become dangerous when the proxy diverges from the underlying reality - and measurement pressure almost always creates this divergence.

Test coverage correlates with quality when coverage reflects genuine scenario testing. When coverage is a target, it reflects the minimum viable assertion to increment the number. The correlation breaks. The metric reads healthy. The quality doesn't.

Sprint velocity correlates with throughput when estimates are honest. When velocity is tracked and compared, estimates drift to manage the expectation. The correlation breaks. The number looks good. The actual delivery rate is unknown.

A metric that has diverged from its underlying reality is worse than no metric. It provides false confidence, redirects attention from the real signal, and creates perverse incentives. This is the measurement trap: not that you're measuring the wrong things, but that the act of measuring changes the behaviour in ways that make the measurement meaningless.

The Right Questions Before Any Metric

Before instrumenting any metric, I ask three questions:

1. What behaviour will this metric incentivise when it becomes a target?

Because it will become a target. Even if you state clearly that it's informational, not evaluative - once it appears on a dashboard that leadership sees, it becomes a target. Design the metric for the world where it's being managed, not the world where it's merely observed.

2. How would I know if this metric has been gamed?

If you can't answer this, you won't see it happening. Every metric needs a corresponding verification signal - something that would degrade or diverge if the primary metric were being managed without genuine improvement.

3. What decision will this metric inform?

"It's useful to track" is not a sufficient answer. If a metric doesn't change what you'd do - if the number could be 10% higher or lower without affecting any decision - it should not be tracked. Measurement has a cost: cognitive load, political overhead, and the law of unintended consequences. Only measure what you will act on.

The Metrics That Hold Up

Not all metrics are equally susceptible to gaming. The most durable engineering metrics share a characteristic: they are hard to improve without genuinely improving the underlying capability.

Cycle time - the time from work starting to work being in production - is relatively hard to game. You can cherry-pick easy items to improve the average, but the distribution tells the truth. The P90 cycle time is almost always more revealing than the median.

Change Failure Rate - the proportion of changes that cause a production incident - is hard to game if your incident definition is agreed and consistently applied. You can reclassify incidents at the margin, but significant manipulation requires visible operational decisions.

Mean Time to Recovery - how long incidents last - is similarly grounded in operational reality. It's harder to manage through classification games because customers and on-call engineers experience the reality directly.

Qualitative team health signals - psychological safety surveys, engineer Net Promoter Score, voluntary retention rate - are imperfect proxies but measure things that are genuinely hard to fake at scale.

The Better Outcome

Better in BVSSH is about genuine quality improvement over time - not the appearance of quality improvement. This requires measurement systems that are honest about what they're actually measuring, designed with Goodhart's Law in mind, and used to inform decisions rather than to evaluate performance.

The best engineering organisations I work with measure sparingly and seriously. They have fewer metrics than most - but those metrics are connected to decisions, verified against reality, and not used as performance indicators for the people generating them.

Fewer metrics, more honestly held, create better outcomes than comprehensive dashboards of gamed numbers.

If your engineering metrics look healthy but your delivery doesn't feel healthy, suspect the measurement system before suspecting the engineers.

Ragan McGill

Engineering leader blending strategy, culture, and craft to build high-performing teams and future-ready platforms. I drive transformation through autonomy, continuous improvement, and data-driven excellence - creating environments where people thrive, innovation flourishes, and outcomes matter. Passionate about empowering others and reshaping engineering for impact at scale. Let’s build better, together.