• Home
  • BVSSH
  • Engineering Enablement
  • Playbooks
  • Frameworks
  • Good Reads
Search

What are you looking for?

Standard : Systems recover quickly and fail safely

Purpose and Strategic Importance

This standard ensures systems are designed to recover quickly and fail safely, reducing the blast radius of incidents and supporting sustainable, high-velocity delivery. It embeds resilience into the architecture, not just the process.

Aligned to our "Resilience Over Uptime" and "Balance Sustainability with Speed" policies, this standard protects user experience and team wellbeing during failure scenarios. Without it, systems become brittle, outages last longer, and recovery depends on manual intervention.

Strategic Impact

Clearly defined impacts of meeting this standard include improved delivery flow, reduced risk, higher system resilience, and better alignment to business needs. Over time, teams will see reduced rework, faster time to value, and stronger system integrity.

Risks of Not Having This Standard

  • Reduced ability to respond to change or failure
  • Accumulation of technical debt or friction
  • Poor developer experience and morale
  • Decreased confidence in releases and features
  • Misalignment between technical implementation and business priorities

CMMI Maturity Model

  • Level 1 – Initial: Recovery processes are manual and rely on individual heroics.

  • Level 2 – Managed: Basic monitoring and rollback mechanisms exist.

  • Level 3 – Defined: Recovery patterns (e.g. failover, auto-rollback) are documented and applied.

  • Level 4 – Quantitatively Managed: Recovery performance (e.g. MTTR) is measured and improved.

  • Level 5 – Optimising: Resilience is engineered into systems and validated continuously through operational data.


Key Measures

  • Adoption metrics relevant to the standard (to be defined)
  • Quality, throughput, and system health metrics aligned to capability
  • Maturity scores based on structured assessment
Associated Policies
  • Resilience Over Uptime
  • Psychological Safety First
Associated Practices
  • Auto-scaling Infrastructure
  • Event Sourcing
  • Evolutionary Architecture
  • Immutable Infrastructure
  • Mocking and Stubbing
  • Security as Code
  • Serverless Architecture
  • Operational KPIs for Dev Teams
  • Service Mesh Implementation
  • Twelve-Factor App
  • Chaos Engineering
  • Health Checks & Readiness Probes
  • Log Correlation for RCA
  • On-Call Rotation Health Checks
  • Runbooks and Playbooks
  • Feedback Loops from Ops to Dev
  • Real-time Event Streaming
  • Blue-Green Deployments
  • Canary Releases
  • Deployment Freeze Windows
  • Container Security Scanning
  • Data Encryption-in-Transit & at-Rest
  • Secure API Gateways
  • Threat Intelligence Feeds
  • Threat Modelling Workshops
  • Vulnerability Management Dashboards
  • Load & Performance Testing
  • Shadow Testing in Production
  • Design for Failure
  • Observability-Driven Design

Technical debt is like junk food - easy now, painful later.

Awesome Blogs
  • LinkedIn Engineering
  • Github Engineering
  • Uber Engineering
  • Code as Craft
  • Medium.engineering