Standard : Mean Time to Recovery (MTTR)
Description
Mean Time to Recovery (MTTR) measures the average time it takes to restore a service to normal operations after a failure, disruption, or incident. It spans detection, diagnosis, and resolution phases—from first alert to verified recovery.
A low MTTR is a key indicator of operational resilience, reflecting how quickly and effectively a team can respond to unplanned issues and minimise impact to users or downstream systems.
How to Use
What to Measure
- The time elapsed between when an incident is first detected or triggered and when the affected service or system is restored to normal operations.
MTTR = Sum of All Recovery Times / Number of Incidents
Segment by:
- Service, team, incident type, or severity.
- Recovery method (manual vs automated).
Instrumentation Tips
- Use incident management systems (e.g. PagerDuty, Opsgenie, ServiceNow) to track timestamps.
- Capture time of alert, diagnosis start, fix, and full recovery.
- Tag post-incident reviews with accurate recovery durations.
- Automate data collection where possible to reduce reporting friction.
Why It Matters
- Operational resilience: Fast recovery limits impact on users and reputation.
- Team readiness: Short MTTR reflects well-defined procedures and strong incident response culture.
- Learning opportunities: Long MTTR can indicate weak observability or missing automation.
- Customer trust: Rapid, visible recovery reinforces system dependability.
Best Practices
- Define clear roles and playbooks for incident response.
- Run incident simulations and chaos engineering experiments.
- Invest in real-time monitoring and alerting tools.
- Implement auto-remediation for known failure patterns.
- Conduct blameless post-incident reviews focused on learning.
Common Pitfalls
- Only measuring from detection to fix, excluding verification of full recovery.
- Inflated recovery times due to unclear ownership or manual steps.
- Lack of historical data due to poor incident tracking.
- Using MTTR in isolation without looking at incident volume or severity.
Signals of Success
- MTTR is trending down across services and teams.
- Recovery timelines are predictable and within agreed SLAs.
- Incidents trigger rapid alerts, clear ownership, and minimal downtime.
- Post-incident reviews lead to concrete, system-level improvements.
- [[Change Failure Rate]]
- [[Incident Volume per Deployment]]
- [[Auto-Healing Coverage]]
- [[System-Level SLA Breaches]]
- [[Time to Detect Data Pipeline Failure]]