• Home
  • BVSSH
  • C4E
  • Playbooks
  • Frameworks
  • Good Reads
Search

What are you looking for?

Standard : Failures are detected and surfaced proactively to enable rapid response and minimise impact

Purpose and Strategic Importance

The goal of a production system is not to never fail — it is to fail safely, recover quickly, and ensure that when failures occur, the engineering team knows before the customer does. Reactive incident management, where teams discover problems through user complaints or support tickets, represents a fundamental failure of operational maturity. This standard establishes that teams must invest in proactive failure detection: alerts that fire based on meaningful signals, on-call processes that are clearly defined and fairly distributed, and runbooks that enable any engineer on the team to respond confidently to the most common failure scenarios.

Proactive failure detection is directly linked to the organisation's ability to honour service level objectives and manage error budgets responsibly. When alerting is well-designed — combining threshold-based rules for known failure modes with anomaly detection for unexpected behaviour — teams spend less time in reactive fire-fighting and more time on deliberate improvement. Alert fatigue, one of the most damaging conditions in on-call engineering, is addressed by continuously refining signal quality: ensuring every alert is actionable, routed to the right person, and tied to a documented response procedure. Teams that operate to this standard achieve lower mean time to detect, faster recovery, and a culture where reliability is an engineered property rather than a hoped-for outcome.

Strategic Impact

  • Reduces customer impact duration by detecting and initiating response to failures before users report them
  • Enables teams to honour SLO commitments and make informed decisions about error budget consumption and feature investment
  • Builds engineering confidence in on-call responsibilities by ensuring alerts are meaningful, actionable, and supported by runbooks
  • Creates a feedback loop between production failure patterns and engineering investment, driving continuous reliability improvement

Risks of Not Having This Standard

  • Customers become the primary detection mechanism for production failures, causing reputational damage and eroding trust
  • Alert fatigue from noisy, poorly calibrated alerting causes engineers to ignore or silence alerts, including genuine critical ones
  • Inconsistent or undefined on-call processes lead to slow response, unclear ownership, and burnout among engineering teams
  • Without error budgets and SLO-aligned alerting, reliability trade-offs are made implicitly and without data-driven justification
  • Absence of runbooks means incident response depends on the specific knowledge of individual engineers, creating dangerous single points of failure

CMMI Maturity Model

Level 1 – Initial

Category Description
People & Culture Failures are discovered through user complaints or informal channels with no structured on-call process.
Process & Governance There is no defined incident response process; individuals respond ad hoc based on availability and knowledge.
Technology & Tools Basic uptime monitoring may exist but alerting is sparse, misconfigured, or relies on manual dashboard checks.
Measurement & Metrics Incident frequency and recovery time are not tracked; there is no visibility into system reliability trends.

Level 2 – Managed

Category Description
People & Culture A designated on-call rotation exists but engineers lack clear guidance, runbooks, or escalation paths.
Process & Governance Basic alerting thresholds are configured for critical services, though coverage is incomplete and inconsistent.
Technology & Tools An alerting platform is in use and connected to an on-call notification tool such as PagerDuty or OpsGenie.
Measurement & Metrics Mean time to detect and mean time to resolve are tracked for major incidents at the service level.

Level 3 – Defined

Category Description
People & Culture All services have defined owners, on-call responsibilities are clearly documented, and runbooks cover common scenarios.
Process & Governance Alert routing, escalation paths, and severity definitions are standardised and applied consistently across services.
Technology & Tools Alerting combines threshold-based rules with anomaly detection, and alerts are linked directly to runbooks.
Measurement & Metrics Alert volume, false positive rate, and time-to-acknowledge are measured and reviewed in regular reliability meetings.

Level 4 – Quantitatively Managed

Category Description
People & Culture Teams conduct structured post-incident reviews that result in improved alerting, runbooks, and system resilience.
Process & Governance SLOs are defined for all critical services, and alerting is aligned to error budget burn rates rather than raw thresholds.
Technology & Tools Correlation and deduplication tooling reduces alert noise, ensuring on-call engineers receive focused, meaningful signals.
Measurement & Metrics Error budget consumption, SLO compliance, and alert signal-to-noise ratio are tracked and reported as reliability KPIs.

Level 5 – Optimising

Category Description
People & Culture Engineering teams proactively invest reliability improvements driven by error budget data and incident learning cycles.
Process & Governance Alerting configuration and runbooks are continuously refined based on incident retrospectives and changing system behaviour.
Technology & Tools Predictive alerting and automated remediation handle a growing proportion of known failure modes without human intervention.
Measurement & Metrics Reliability metrics are aggregated across services and used to inform platform strategy, staffing, and architectural investment.

Key Measures

  • Mean time to detect (MTTD) for production failures, measured from first symptom to alert firing
  • Mean time to acknowledge (MTTA) for high-severity alerts, measured against defined SLA thresholds
  • Alert false positive rate, tracking the percentage of alerts that do not correspond to a genuine actionable issue
  • Percentage of services with defined SLOs, configured error budget alerts, and linked runbooks in place
  • Error budget burn rate across critical services over rolling 30-day windows
  • Number of incidents where the first notification came from a customer or support channel rather than the alerting system
Associated Policies
  • Engineering Excellence First

Technical debt is like junk food - easy now, painful later.

Awesome Blogs
  • LinkedIn Engineering
  • Github Engineering
  • Uber Engineering
  • Code as Craft
  • Medium.engineering