• Home
  • BVSSH
  • C4E
  • Playbooks
  • Frameworks
  • Good Reads
Search

What are you looking for?

Standard : Mean Time to Detect (MTTD)

Description

Mean Time to Detect (MTTD) measures the average time it takes for a team to become aware of a defect, incident, or failure after it occurs in production. It is a key indicator of how well teams monitor, observe, and respond to issues in real time.

A lower MTTD reflects mature observability, well-defined alerting, and responsive feedback loops—essential for building high-quality, resilient systems.

How to Use

What to Measure

  • Time from when a fault, failure, or defect occurs in production to when the team is alerted or becomes aware.
  • Include only confirmed, real incidents (filter out false positives).
  • Segment by system, severity, environment (e.g. staging vs production).

Formula

MTTD = Total Detection Time Across Incidents / Number of Incidents

You may also break this down by:

  • Alert type (manual, automated)
  • Source of detection (customer report, monitoring, logs)
  • Detection delay by severity level

Instrumentation Tips

  • Integrate monitoring tools (e.g. Prometheus, Datadog, New Relic) with alerting platforms (e.g. PagerDuty, Opsgenie).
  • Tag incidents with timestamped "fault started" and "detected" fields.
  • Track whether detection was automatic, customer-reported, or discovered later.

Why It Matters

  • Reduces user impact: Early detection shortens time to containment and resolution.
  • Validates observability: Demonstrates how well your systems "tell you when they’re broken".
  • Improves team response: Enables faster triage and recovery through real-time visibility.
  • Supports root cause analysis: Faster detection = more accurate diagnosis and less disruption.

Best Practices

  • Define and monitor SLIs that reflect real system behaviour.
  • Implement alert fatigue management to avoid missed or ignored issues.
  • Pair detection metrics with MTTR (Mean Time to Recovery) for a full view.
  • Use synthetic monitoring for critical paths and user journeys.
  • Continuously review and refine alert thresholds, dashboards, and detection logic.

Common Pitfalls

  • Relying on users or support to report incidents.
  • High MTTD caused by noisy alerts, slow triage, or lack of visibility.
  • False positives inflating detection times or lowering trust in alerts.
  • Measuring only major incidents, missing low-severity issues that indicate drift.

Signals of Success

  • Most incidents are detected automatically within minutes.
  • Teams act on alerts before customers are affected.
  • Alert signals have high precision and low noise.
  • MTTD is visible and improving as observability matures.

Related Measures

  • [[Change Failure Rate]]
  • [[CoE/Engineering/Measures/Delivery Performance/Mean Time to Recovery (MTTR)]]
  • [[Defect Escape Rate]]
  • [[Quality Gate Compliance]]
  • [[System Health SLO Compliance]]

Technical debt is like junk food - easy now, painful later.

Awesome Blogs
  • LinkedIn Engineering
  • Github Engineering
  • Uber Engineering
  • Code as Craft
  • Medium.engineering