• Home
  • BVSSH
  • Engineering Enablement
  • Playbooks
  • Frameworks
  • Good Reads
Search

What are you looking for?

Standard : Failure modes are proactively tested

Purpose and Strategic Importance

This standard ensures teams proactively test failure modes to build resilience and uncover weaknesses before they impact users. It drives a culture of engineering excellence where systems are designed to handle the unexpected gracefully.

Aligned to our "Engineering Excellence First" policy, this standard reduces downtime, improves confidence in releases, and strengthens operational readiness. Without it, failures are harder to diagnose, more costly to fix, and more likely to erode trust.

Strategic Impact

Clearly defined impacts of meeting this standard include improved delivery flow, reduced risk, higher system resilience, and better alignment to business needs. Over time, teams will see reduced rework, faster time to value, and stronger system integrity.

Risks of Not Having This Standard

  • Reduced ability to respond to change or failure
  • Accumulation of technical debt or friction
  • Poor developer experience and morale
  • Decreased confidence in releases and features
  • Misalignment between technical implementation and business priorities

CMMI Maturity Model

  • Level 1 – Initial: Failures are only addressed reactively.

  • Level 2 – Managed: Some testing of failure scenarios occurs during development.

  • Level 3 – Defined: Failure scenarios are documented and tested systematically.

  • Level 4 – Quantitatively Managed: Coverage and frequency of failure testing are tracked.

  • Level 5 – Optimising: Failure testing is continuous and adaptive, based on live risk and system complexity.


Key Measures

  • Adoption metrics relevant to the standard (to be defined)
  • Quality, throughput, and system health metrics aligned to capability
  • Maturity scores based on structured assessment
Associated Policies
  • Post-Incident Learning Culture
Associated Practices
  • Health Checks & Readiness Probes
  • Runbooks and Playbooks
  • Self-Healing Systems
  • Incident Response Playbooks
  • Behaviour-Driven Development (BDD)
  • Contract Testing
  • End-to-End (E2E) Testing
  • Exploratory Testing
  • Integration Testing
  • Load & Performance Testing
  • Mutation Testing
  • Non-functional Requirement Testing
  • Shadow Testing in Production
  • Test-Driven Development (TDD)
  • Visual Regression Testing
Associated Measures
  • Change Failure Rate (CFR)
  • Mean Time to Recovery (MTTR)
  • Mean Time to Detect (MTTD)
  • Automated Remediation Rate
  • Error Budget Consumption
  • Incident Frequency
  • Service Availability (Uptime)
  • Security Incident Response Time

Technical debt is like junk food - easy now, painful later.

Awesome Blogs
  • LinkedIn Engineering
  • Github Engineering
  • Uber Engineering
  • Code as Craft
  • Medium.engineering