Standard : Mean Time to Recovery (MTTR)

Description

Mean Time to Recovery (MTTR) measures the average time it takes to restore a service to normal operations after a failure, disruption, or incident. It spans detection, diagnosis, and resolution phases—from first alert to verified recovery.

A low MTTR is a key indicator of operational resilience, reflecting how quickly and effectively a team can respond to unplanned issues and minimise impact to users or downstream systems.

How to Use

What to Measure

The time elapsed between when an incident is first detected or triggered and when the affected service or system is restored to normal operations.

Formula

MTTR = Sum of All Recovery Times / Number of Incidents

Segment by:

Service, team, incident type, or severity.
Recovery method (manual vs automated).

Instrumentation Tips

Use incident management systems (e.g. PagerDuty, Opsgenie, ServiceNow) to track timestamps.
Capture time of alert, diagnosis start, fix, and full recovery.
Tag post-incident reviews with accurate recovery durations.
Automate data collection where possible to reduce reporting friction.

Why It Matters

Operational resilience: Fast recovery limits impact on users and reputation.
Team readiness: Short MTTR reflects well-defined procedures and strong incident response culture.
Learning opportunities: Long MTTR can indicate weak observability or missing automation.
Customer trust: Rapid, visible recovery reinforces system dependability.

Best Practices

Define clear roles and playbooks for incident response.
Run incident simulations and chaos engineering experiments.
Invest in real-time monitoring and alerting tools.
Implement auto-remediation for known failure patterns.
Conduct blameless post-incident reviews focused on learning.

Common Pitfalls

Only measuring from detection to fix, excluding verification of full recovery.
Inflated recovery times due to unclear ownership or manual steps.
Lack of historical data due to poor incident tracking.
Using MTTR in isolation without looking at incident volume or severity.

Signals of Success

MTTR is trending down across services and teams.
Recovery timelines are predictable and within agreed SLAs.
Incidents trigger rapid alerts, clear ownership, and minimal downtime.
Post-incident reviews lead to concrete, system-level improvements.

[[Change Failure Rate]]
[[Incident Volume per Deployment]]
[[Auto-Healing Coverage]]
[[System-Level SLA Breaches]]
[[Time to Detect Data Pipeline Failure]]