Standard : Mean Time to Detect (MTTD)

Description

Mean Time to Detect (MTTD) measures the average time it takes for a team to become aware of a defect, incident, or failure after it occurs in production. It is a key indicator of how well teams monitor, observe, and respond to issues in real time.

A lower MTTD reflects mature observability, well-defined alerting, and responsive feedback loops—essential for building high-quality, resilient systems.

How to Use

What to Measure

Time from when a fault, failure, or defect occurs in production to when the team is alerted or becomes aware.
Include only confirmed, real incidents (filter out false positives).
Segment by system, severity, environment (e.g. staging vs production).

Formula

MTTD = Total Detection Time Across Incidents / Number of Incidents

You may also break this down by:

Alert type (manual, automated)
Source of detection (customer report, monitoring, logs)
Detection delay by severity level

Instrumentation Tips

Integrate monitoring tools (e.g. Prometheus, Datadog, New Relic) with alerting platforms (e.g. PagerDuty, Opsgenie).
Tag incidents with timestamped "fault started" and "detected" fields.
Track whether detection was automatic, customer-reported, or discovered later.

Why It Matters

Reduces user impact: Early detection shortens time to containment and resolution.
Validates observability: Demonstrates how well your systems "tell you when they’re broken".
Improves team response: Enables faster triage and recovery through real-time visibility.
Supports root cause analysis: Faster detection = more accurate diagnosis and less disruption.

Best Practices

Define and monitor SLIs that reflect real system behaviour.
Implement alert fatigue management to avoid missed or ignored issues.
Pair detection metrics with MTTR (Mean Time to Recovery) for a full view.
Use synthetic monitoring for critical paths and user journeys.
Continuously review and refine alert thresholds, dashboards, and detection logic.

Common Pitfalls

Relying on users or support to report incidents.
High MTTD caused by noisy alerts, slow triage, or lack of visibility.
False positives inflating detection times or lowering trust in alerts.
Measuring only major incidents, missing low-severity issues that indicate drift.

Signals of Success

Most incidents are detected automatically within minutes.
Teams act on alerts before customers are affected.
Alert signals have high precision and low noise.
MTTD is visible and improving as observability matures.

[[Change Failure Rate]]
[[CoE/Engineering/Measures/Delivery Performance/Mean Time to Recovery (MTTR)]]
[[Defect Escape Rate]]
[[Quality Gate Compliance]]
[[System Health SLO Compliance]]