Team Health Metrics | Measurement Operating Model

Why Team Health Measurement Matters

Delivery metrics tell you the state of the system today. They do not tell you what the system will look like in six months. A team that is hitting its delivery targets while quietly burning out, losing its best engineers, or operating under unsustainable cognitive load is a team that is consuming its future performance to fund its current performance.

Team health metrics are an attempt to measure the conditions that predict future performance - the factors that determine whether a team will still be performing well in two quarters, or whether it is currently on a trajectory toward problems that will be visible by then.

The research basis for this is substantial. Gallup's decades of employee engagement research, Google's Project Aristotle findings on psychological safety, the DORA research linking job satisfaction to delivery performance, and the work of Amy Edmondson on team learning - all point toward the same finding: the conditions a team works in predict the outcomes the team produces, and those conditions can be measured.

The challenge is that measuring team health honestly is uncomfortable. The findings often implicate management decisions, organisational structure, or leadership behaviour. If you measure team health and then ignore what you find, you have done something worse than not measuring - you have demonstrated that feedback is not safe.

The Spotify Squad Health Check Model

In 2014, Spotify published a team health check model that became widely adopted. The model evaluates teams across dimensions including easy to release, suitable process, tech quality, value, speed, mission, fun, learning, support, and pawns vs players (the degree to which the team has agency over its own decisions).

Teams rate themselves on each dimension as green, yellow, or red, and the trend (improving, stable, deteriorating) is as important as the current state.

The model has significant value as a conversation structure. It creates a shared vocabulary, surfaces issues that might otherwise stay invisible, and creates a regular moment for teams to reflect on their own conditions rather than just their output.

Its limitations are also significant. Self-assessment is subject to social desirability bias - teams rate themselves higher than their actual state when they are worried about how the results will be used. Dimensions like "fun" are poorly defined. Comparison across teams is problematic because teams calibrate differently.

The model works best when it is run by the team for the team, the results are not used for external ranking or comparison, and there is a genuine commitment from leadership to respond to what is found. Used as a surveillance mechanism, it produces dishonest answers. Used as a team reflection tool with psychological safety, it produces useful ones.

Engagement Surveys: What They Measure and What They Do Not

Annual engagement surveys are the most common mechanism for measuring team health at scale. Used well, they provide useful signal. Used badly, they consume significant organisational energy and produce data nobody acts on.

What engagement surveys measure: overall satisfaction with the job, team, manager, and organisation; intent to stay; sense of purpose and connection to company mission; perceptions of fairness and recognition; and confidence in leadership.

What they do not measure: specific technical or process problems, cognitive load, psychological safety (the standard engagement constructs are not sensitive enough), or the conditions for high performance on specific engineering practices.

The limitations: annual surveys capture a moment in time. If the survey lands during a particularly stressful period (a failed launch, a reorg announcement), results will be depressed. If it lands during a high point, they will be elevated. A single annual data point is not a trend.

Response rates matter. A 60% response rate means 40% of your engineers did not respond - and there is evidence that non-responders skew negative. Treat a survey with a low response rate as partial data, not representative data.

The most damaging failure mode of engagement surveys is the "survey and forget" pattern: collect data, share headline numbers, announce action plans that are never implemented, run the same survey the following year. After one or two cycles of this, engagement scores drop further - not because conditions deteriorated, but because people lost faith that the survey changes anything.

Pulse Surveys: Higher Frequency, Lower Depth

Pulse surveys - short, frequent surveys on a small number of questions - complement or replace annual surveys in many organisations. Four to eight questions, sent monthly or quarterly, give a more responsive picture of trends than an annual cycle.

The standard pulse questions that have the most predictive validity: "I would recommend this organisation as a great place to work," "I feel my work is valued," "I have clarity about what is expected of me," and "I feel I can raise concerns without negative consequences."

The advantage of pulse surveys is trend visibility. If engagement falls between March and May, you have a signal that something changed in that window. You can investigate the specific driver rather than relying on annual anecdote.

The risk is survey fatigue. If engineers are receiving a survey every two weeks and never seeing evidence that responses affect anything, they will stop responding. Pulse surveys require even more discipline around acting on findings than annual surveys, because the feedback loop is tighter and the expectation of response is higher.

Psychological Safety Measurement

Psychological safety - the belief that the team environment is safe for interpersonal risk-taking, that you will not be punished for speaking up with ideas, questions, concerns, or mistakes - is the single strongest predictor of team effectiveness in Google's research and one of the most robust constructs in organisational psychology.

Edmondson's original psychological safety scale uses seven items, including: "If you make a mistake on this team, it is often held against you" (reverse-scored), "Members of this team are able to bring up problems and tough issues," and "It is safe to take a risk on this team."

The scale is reliable and valid. It is also sensitive to how results are used. If teams believe that psychological safety scores will be used to evaluate managers, scores will be inflated. The scale should be administered anonymously and with clear commitment that results will be shared back to the team and used to drive improvement, not to rank or evaluate.

Engineering-specific low-safety signals: engineers who do not speak up in code reviews, silent retrospectives where real problems are not raised, incident postmortems that implicitly identify individuals rather than systems, and a pattern of only positive news reaching leadership.

Cognitive Load Signals

Cognitive load is the mental effort required to do the work. High cognitive load - too many concurrent responsibilities, unclear priorities, complex systems without adequate documentation, constant context switching - is one of the primary drivers of engineer burnout and a direct impediment to high-quality work.

Cognitive load is harder to measure than engagement because there is no standard validated scale in common use. Signals to watch for:

Overlong on-call rotations: if engineers are on-call more than one week in four (in a team of fewer than eight), the cognitive and physical burden is unsustainable.

High work in progress: teams with many concurrent workstreams have higher cognitive load than teams focused on fewer things. Measuring WIP per team is a useful proxy.

Context switching rate: how often are engineers switching between unrelated workstreams or projects in a given week? This can be measured approximately from calendar data or self-report.

Interruption frequency: are engineers able to maintain focused work periods, or are they in a constant state of reactive response to messages, meetings, and urgent requests?

The Team Topologies framework uses cognitive load explicitly as a principle for team design - teams should own only as much as they can reasonably understand and operate. Assessing cognitive load against this criterion is a useful design exercise.

Retention as a Lagging Health Metric

Voluntary attrition is a lagging indicator of team health - by the time engineers are leaving, the conditions that drove them out have been present for months or years. But tracking it carefully provides useful signal about the past state of the organisation and the effectiveness of improvement efforts.

Track attrition by team, not just organisation-wide. Organisation-wide attrition of 15% may mask one team at 5% and another at 40%. The team-level pattern is where the signal is.

Track regrettable attrition separately from total attrition. Not all departures are equal. Losing a high performer who goes to a competitor is different from losing someone who was underperforming and will not be replaced.

Exit interview data is valuable but requires careful handling. People leaving often say positive things in exit interviews to preserve references and relationships. Anonymous exit surveys conducted 30-90 days after departure produce more honest data.

Running a Team Health Review That Produces Honest Answers

The conditions for an honest team health review: psychological safety exists in the team (or is created for the session), the purpose is clearly understood as improvement not evaluation, the facilitator is neutral (ideally not the team's direct manager), and there is a visible commitment to follow-up.

Practical format: a structured retrospective focused on health dimensions rather than delivery outcomes. Use anonymous input collection (digital tools like Mentimeter or physical sticky notes) for sensitive topics. Create space for honest assessment of what is genuinely working and what is not.

The most important part is what happens after. Identify the two or three highest-priority issues. Create specific actions with owners and timelines. Revisit at the next health check to confirm whether they were addressed. This closes the loop and demonstrates that the measurement has consequences.

← Previous OKRs for Engineering Next → Engineering Performance Reporting