This standard ensures systems are designed to recover quickly and fail safely, reducing the blast radius of incidents and supporting sustainable, high-velocity delivery. It embeds resilience into the architecture, not just the process.
Aligned to our "Resilience Over Uptime" and "Balance Sustainability with Speed" policies, this standard protects user experience and team wellbeing during failure scenarios. Without it, systems become brittle, outages last longer, and recovery depends on manual intervention.
Clearly defined impacts of meeting this standard include improved delivery flow, reduced risk, higher system resilience, and better alignment to business needs. Over time, teams will see reduced rework, faster time to value, and stronger system integrity.
Level 1 – Initial: Recovery processes are manual and rely on individual heroics.
Level 2 – Managed: Basic monitoring and rollback mechanisms exist.
Level 3 – Defined: Recovery patterns (e.g. failover, auto-rollback) are documented and applied.
Level 4 – Quantitatively Managed: Recovery performance (e.g. MTTR) is measured and improved.
Level 5 – Optimising: Resilience is engineered into systems and validated continuously through operational data.