• Home
  • BVSSH
  • C4E
  • Playbooks
  • Frameworks
  • Good Reads
Search

What are you looking for?

Standard : Mean Time to Recovery (MTTR)

Description

Mean Time to Recovery (MTTR) measures the average time it takes to restore a service to normal operations after a failure, disruption, or incident. It spans detection, diagnosis, and resolution phases—from first alert to verified recovery.

A low MTTR is a key indicator of operational resilience, reflecting how quickly and effectively a team can respond to unplanned issues and minimise impact to users or downstream systems.

How to Use

What to Measure

  • The time elapsed between when an incident is first detected or triggered and when the affected service or system is restored to normal operations.

Formula

MTTR = Sum of All Recovery Times / Number of Incidents

Segment by:

  • Service, team, incident type, or severity.
  • Recovery method (manual vs automated).

Instrumentation Tips

  • Use incident management systems (e.g. PagerDuty, Opsgenie, ServiceNow) to track timestamps.
  • Capture time of alert, diagnosis start, fix, and full recovery.
  • Tag post-incident reviews with accurate recovery durations.
  • Automate data collection where possible to reduce reporting friction.

Why It Matters

  • Operational resilience: Fast recovery limits impact on users and reputation.
  • Team readiness: Short MTTR reflects well-defined procedures and strong incident response culture.
  • Learning opportunities: Long MTTR can indicate weak observability or missing automation.
  • Customer trust: Rapid, visible recovery reinforces system dependability.

Best Practices

  • Define clear roles and playbooks for incident response.
  • Run incident simulations and chaos engineering experiments.
  • Invest in real-time monitoring and alerting tools.
  • Implement auto-remediation for known failure patterns.
  • Conduct blameless post-incident reviews focused on learning.

Common Pitfalls

  • Only measuring from detection to fix, excluding verification of full recovery.
  • Inflated recovery times due to unclear ownership or manual steps.
  • Lack of historical data due to poor incident tracking.
  • Using MTTR in isolation without looking at incident volume or severity.

Signals of Success

  • MTTR is trending down across services and teams.
  • Recovery timelines are predictable and within agreed SLAs.
  • Incidents trigger rapid alerts, clear ownership, and minimal downtime.
  • Post-incident reviews lead to concrete, system-level improvements.

Related Measures

  • [[Change Failure Rate]]
  • [[Incident Volume per Deployment]]
  • [[Auto-Healing Coverage]]
  • [[System-Level SLA Breaches]]
  • [[Time to Detect Data Pipeline Failure]]

Technical debt is like junk food - easy now, painful later.

Awesome Blogs
  • LinkedIn Engineering
  • Github Engineering
  • Uber Engineering
  • Code as Craft
  • Medium.engineering