← Delivery Operating Model

Delivery Metrics

What to measure, what to ignore, and how to use metrics without destroying the thing you are trying to improve.

Delivery metrics tell you how your system is performing. Used well, they expose bottlenecks and drive improvement. Used badly, they become targets that teams game and managers abuse. This covers the right metrics, the right questions, and the right conversations.

Why Metrics Go Wrong

Before covering which metrics to use, it is worth understanding why metrics so often create problems rather than solving them.

Goodhart's Law states that when a measure becomes a target, it ceases to be a good measure. This is not abstract theory - it describes precisely what happens in most engineering organisations when metrics are introduced. The moment a team is evaluated against a metric, they optimise for the metric rather than for the underlying thing the metric was trying to measure.

Velocity is the clearest example. Velocity - the sum of story points completed per sprint - was designed as a planning tool to help teams predict how much work they could take on. Used as a performance metric, teams inflate estimates, avoid taking on uncertain work, and split tickets into smaller pieces. The number goes up. Delivery does not.

The solution is not to avoid metrics - delivery systems genuinely benefit from measurement. The solution is to use metrics as diagnostic tools for understanding the system rather than as targets that evaluate people.

DORA: The Baseline

The DORA metrics - derived from the State of DevOps reports and the research behind Accelerate - represent the most rigorous evidence base for what predicts software delivery performance.

The four metrics are:

Deployment Frequency - how often code is deployed to production. High performers deploy multiple times per day. Low performers deploy monthly or less.

Lead Time for Changes - the time from code being committed to it running in production. High performers measure this in hours. Low performers in weeks or months.

Change Failure Rate - the percentage of deployments that cause a degradation requiring remediation. High performers see 0-15% failure rates. Low performers see 46-60%.

Mean Time to Restore - when a failure occurs, how long to restore service. High performers restore in under an hour. Low performers take days.

These four metrics are useful because they cover both speed and stability. Optimising for speed alone typically degrades stability. The DORA metrics, used together, prevent that trade-off from being hidden.

Start here. If you have none of these measured, measure them first. They are your baseline for understanding system performance.

Flow Metrics

DORA metrics tell you about your overall delivery system. Flow metrics give you visibility into the mechanics of how work moves through your process.

Cycle Time

Cycle time is the time from when work starts to when it is done. The specific definition matters - "starts" should mean the moment an engineer actively begins working on something, not when it was added to a backlog.

Cycle time is your most direct signal of flow efficiency. High cycle times with low WIP indicates work is complex or blocked. High cycle times with high WIP indicates too much is being worked on simultaneously. Tracking cycle time over time and by work type reveals patterns that are invisible in aggregate.

Throughput

Throughput is the number of items completed per unit of time - usually per week or per sprint. Unlike velocity, throughput is item-count based and does not depend on estimation. It measures actual delivery rate rather than estimated complexity rate.

Throughput is most useful when combined with cycle time. Little's Law tells you the relationship: WIP equals throughput multiplied by cycle time. If throughput is constant but cycle time is increasing, WIP is growing - a clear signal that more is being started than finished.

WIP

Work In Progress at any point in time. High WIP is almost always a problem. It increases cycle time, increases context switching, and reduces the team's ability to respond to changing priorities.

Track WIP per person and per stage. When WIP is consistently above your team's comfortable limit, it indicates either too many requests are being accepted or there are systemic blockers that are preventing completion.

Aging Work Items

Items that have been in a given state beyond a threshold age are aging items. What counts as aging depends on your typical cycle time - if your average is five days, items older than ten days are aging. Items older than your threshold are candidates for active investigation: are they blocked? Have they been forgotten? Are they genuinely complex?

Aging work items are a leading indicator of future delivery problems. Addressing them early prevents them becoming chronic.

Predictability: The Often-Overlooked Metric

Most delivery measurement focuses on speed. Predictability - the degree to which your commitments match your outcomes - is at least as important and far less often measured.

A team that commits to six items per sprint and consistently delivers six is more valuable than a team that commits to ten and consistently delivers six. The second team is producing the same output but consuming significantly more planning capacity, generating disappointment, and eroding trust.

Measure predictability as: items committed vs items completed, over rolling sprints. Track the trend. A team whose predictability is improving is developing a more accurate understanding of their capacity and constraints. A team with consistently poor predictability has a systemic issue - estimation, scope control, or unplanned work - that needs investigation.

Why Story Points Fail as a Metric

Story points were designed to estimate relative complexity within a single team. They were explicitly not designed to measure productivity or to compare across teams.

When story points are used as a performance metric, several things happen:

Teams inflate estimates over time. If you are evaluated on points delivered, the incentive is to estimate high. This is not dishonesty - it is rational response to incentive.

Cross-team comparison becomes toxic. A team with consistently high velocity might be excellent, or might have very generous estimation culture. You cannot tell from the number.

The metric drives gaming. Teams split stories to increase point count. Teams avoid taking on uncertain stories because they cannot estimate them confidently. Teams mark stories done before they are properly tested.

The alternative is to measure what you actually care about: items delivered, cycle time, change failure rate. These are harder to game because they are grounded in observable events rather than subjective estimates.

Building a Metrics Culture

Metrics do not improve delivery. Actions informed by metrics improve delivery. The culture around metrics - how they are discussed, who owns them, how they are used in decisions - determines whether they help or harm.

Metrics as Conversation Starters

The right question to ask about any metric is: what does this tell us about our system, and what would we change in response? Metrics should always lead to questions, not conclusions. A high cycle time is not evidence of poor performance - it is a prompt to ask where in the cycle time work is spending the most time, and why.

Metrics in the Right Hands

Delivery metrics should be owned by the team and used for their own improvement. When metrics are owned by managers and used to evaluate teams, the gaming dynamic becomes almost inevitable. The team's relationship with the data shifts from curiosity to defensiveness.

This does not mean leaders cannot see metrics - they should. It means the primary purpose of metrics is improvement, not evaluation. Leaders who use metrics to ask "how can I help remove these bottlenecks?" generate better outcomes than leaders who use them to ask "why is this team's velocity lower than last sprint?"

Transparency and Context

When metrics are shared, context must accompany them. A team that took on a significant refactoring effort will have lower throughput that sprint. A team that responded to a major incident will show degraded flow metrics for that period. Metrics without context lead to wrong conclusions.

Build the habit of annotating metric data with context: what happened this period that affected these numbers? What experiments did we run? What did we learn?

Cadences for Reviewing Delivery Data

Different metrics warrant different review frequencies.

Daily or per-standup: WIP and aging items. These are operational signals that should inform daily work decisions.

Weekly or per-sprint: Cycle time, throughput, deployment frequency. These are system-level signals that inform process decisions.

Monthly or quarterly: DORA metrics in aggregate, predictability trends, comparison to previous quarters. These are strategic signals that inform investment decisions.

Build regular review of delivery data into your team's operating rhythm. A fifteen-minute monthly review of flow trends with the team, looking at what the data suggests about process changes, is more valuable than any dashboard nobody reads.

The goal is a team that understands its own system well enough to diagnose its own problems and drive its own improvement. Metrics are the instrument panel. The team is the pilot.