Standard : Mean Time to Detect (MTTD)
Description
Mean Time to Detect (MTTD) measures the average time it takes for a team to become aware of a defect, incident, or failure after it occurs in production. It is a key indicator of how well teams monitor, observe, and respond to issues in real time.
A lower MTTD reflects mature observability, well-defined alerting, and responsive feedback loops—essential for building high-quality, resilient systems.
How to Use
What to Measure
- Time from when a fault, failure, or defect occurs in production to when the team is alerted or becomes aware.
- Include only confirmed, real incidents (filter out false positives).
- Segment by system, severity, environment (e.g. staging vs production).
MTTD = Total Detection Time Across Incidents / Number of Incidents
You may also break this down by:
- Alert type (manual, automated)
- Source of detection (customer report, monitoring, logs)
- Detection delay by severity level
Instrumentation Tips
- Integrate monitoring tools (e.g. Prometheus, Datadog, New Relic) with alerting platforms (e.g. PagerDuty, Opsgenie).
- Tag incidents with timestamped "fault started" and "detected" fields.
- Track whether detection was automatic, customer-reported, or discovered later.
Why It Matters
- Reduces user impact: Early detection shortens time to containment and resolution.
- Validates observability: Demonstrates how well your systems "tell you when they’re broken".
- Improves team response: Enables faster triage and recovery through real-time visibility.
- Supports root cause analysis: Faster detection = more accurate diagnosis and less disruption.
Best Practices
- Define and monitor SLIs that reflect real system behaviour.
- Implement alert fatigue management to avoid missed or ignored issues.
- Pair detection metrics with MTTR (Mean Time to Recovery) for a full view.
- Use synthetic monitoring for critical paths and user journeys.
- Continuously review and refine alert thresholds, dashboards, and detection logic.
Common Pitfalls
- Relying on users or support to report incidents.
- High MTTD caused by noisy alerts, slow triage, or lack of visibility.
- False positives inflating detection times or lowering trust in alerts.
- Measuring only major incidents, missing low-severity issues that indicate drift.
Signals of Success
- Most incidents are detected automatically within minutes.
- Teams act on alerts before customers are affected.
- Alert signals have high precision and low noise.
- MTTD is visible and improving as observability matures.
- [[Change Failure Rate]]
- [[CoE/Engineering/Measures/Delivery Performance/Mean Time to Recovery (MTTR)]]
- [[Defect Escape Rate]]
- [[Quality Gate Compliance]]
- [[System Health SLO Compliance]]