Standard : Failures are detected and surfaced proactively to enable rapid response and minimise impact

Purpose and Strategic Importance

The goal of a production system is not to never fail — it is to fail safely, recover quickly, and ensure that when failures occur, the engineering team knows before the customer does. Reactive incident management, where teams discover problems through user complaints or support tickets, represents a fundamental failure of operational maturity. This standard establishes that teams must invest in proactive failure detection: alerts that fire based on meaningful signals, on-call processes that are clearly defined and fairly distributed, and runbooks that enable any engineer on the team to respond confidently to the most common failure scenarios.

Proactive failure detection is directly linked to the organisation's ability to honour service level objectives and manage error budgets responsibly. When alerting is well-designed — combining threshold-based rules for known failure modes with anomaly detection for unexpected behaviour — teams spend less time in reactive fire-fighting and more time on deliberate improvement. Alert fatigue, one of the most damaging conditions in on-call engineering, is addressed by continuously refining signal quality: ensuring every alert is actionable, routed to the right person, and tied to a documented response procedure. Teams that operate to this standard achieve lower mean time to detect, faster recovery, and a culture where reliability is an engineered property rather than a hoped-for outcome.

Strategic Impact

Reduces customer impact duration by detecting and initiating response to failures before users report them
Enables teams to honour SLO commitments and make informed decisions about error budget consumption and feature investment
Builds engineering confidence in on-call responsibilities by ensuring alerts are meaningful, actionable, and supported by runbooks
Creates a feedback loop between production failure patterns and engineering investment, driving continuous reliability improvement

Risks of Not Having This Standard

Customers become the primary detection mechanism for production failures, causing reputational damage and eroding trust
Alert fatigue from noisy, poorly calibrated alerting causes engineers to ignore or silence alerts, including genuine critical ones
Inconsistent or undefined on-call processes lead to slow response, unclear ownership, and burnout among engineering teams
Without error budgets and SLO-aligned alerting, reliability trade-offs are made implicitly and without data-driven justification
Absence of runbooks means incident response depends on the specific knowledge of individual engineers, creating dangerous single points of failure

CMMI Maturity Model

Level 1 – Initial

Category	Description
People & Culture	Failures are discovered through user complaints or informal channels with no structured on-call process.
Process & Governance	There is no defined incident response process; individuals respond ad hoc based on availability and knowledge.
Technology & Tools	Basic uptime monitoring may exist but alerting is sparse, misconfigured, or relies on manual dashboard checks.
Measurement & Metrics	Incident frequency and recovery time are not tracked; there is no visibility into system reliability trends.

Level 2 – Managed

Category	Description
People & Culture	A designated on-call rotation exists but engineers lack clear guidance, runbooks, or escalation paths.
Process & Governance	Basic alerting thresholds are configured for critical services, though coverage is incomplete and inconsistent.
Technology & Tools	An alerting platform is in use and connected to an on-call notification tool such as PagerDuty or OpsGenie.
Measurement & Metrics	Mean time to detect and mean time to resolve are tracked for major incidents at the service level.

Level 3 – Defined

Category	Description
People & Culture	All services have defined owners, on-call responsibilities are clearly documented, and runbooks cover common scenarios.
Process & Governance	Alert routing, escalation paths, and severity definitions are standardised and applied consistently across services.
Technology & Tools	Alerting combines threshold-based rules with anomaly detection, and alerts are linked directly to runbooks.
Measurement & Metrics	Alert volume, false positive rate, and time-to-acknowledge are measured and reviewed in regular reliability meetings.

Level 4 – Quantitatively Managed

Category	Description
People & Culture	Teams conduct structured post-incident reviews that result in improved alerting, runbooks, and system resilience.
Process & Governance	SLOs are defined for all critical services, and alerting is aligned to error budget burn rates rather than raw thresholds.
Technology & Tools	Correlation and deduplication tooling reduces alert noise, ensuring on-call engineers receive focused, meaningful signals.
Measurement & Metrics	Error budget consumption, SLO compliance, and alert signal-to-noise ratio are tracked and reported as reliability KPIs.

Level 5 – Optimising

Category	Description
People & Culture	Engineering teams proactively invest reliability improvements driven by error budget data and incident learning cycles.
Process & Governance	Alerting configuration and runbooks are continuously refined based on incident retrospectives and changing system behaviour.
Technology & Tools	Predictive alerting and automated remediation handle a growing proportion of known failure modes without human intervention.
Measurement & Metrics	Reliability metrics are aggregated across services and used to inform platform strategy, staffing, and architectural investment.

Key Measures

Mean time to detect (MTTD) for production failures, measured from first symptom to alert firing
Mean time to acknowledge (MTTA) for high-severity alerts, measured against defined SLA thresholds
Alert false positive rate, tracking the percentage of alerts that do not correspond to a genuine actionable issue
Percentage of services with defined SLOs, configured error budget alerts, and linked runbooks in place
Error budget burn rate across critical services over rolling 30-day windows
Number of incidents where the first notification came from a customer or support channel rather than the alerting system