• Home
  • BVSSH
  • Engineering Enablement
  • Playbooks
  • Frameworks
  • Good Reads
Search

What are you looking for?

Policy : Resilience Over Uptime

Commitment to Building Systems That Recover, Adapt, and Thrive
100% uptime is an illusion, but resilience is a necessity. Rather than chasing unrealistic availability targets, we prioritise building resilient systems that gracefully degrade, recover quickly, and adapt to failures without disrupting business operations.
We recognise that failures are inevitable, but their impact should not be catastrophic. By embedding resilience into our system architectures, we ensure that our customers experience reliability, even when things go wrong.

What This Means
Teams must design for failure, not just availability. Instead of focusing solely on uptime metrics, we ensure that systems can withstand, recover from, and adapt to disruptions.

Our commitment to Resilience Over Uptime is built on:

  • Graceful Degradation & Fault Isolation – Systems are designed to fail safely, ensuring that issues in one area do not bring down entire services.
  • Automated Recovery & Self-Healing – We implement auto-scaling, failover mechanisms, and automated incident response to reduce downtime and manual intervention.
  • Chaos Engineering & Failure Testing – We proactively test how systems behave under failure scenarios, ensuring preparedness for real-world incidents.
  • Observability & Incident Response Readiness – We ensure that real-time monitoring, alerting, and response mechanisms enable teams to detect and address issues before they impact customers.
  • Resilience as a Shared Responsibility – Resilience is embedded into engineering, operations, and business continuity planning, ensuring that reliability is not just an operational afterthought.

Why This Matters
Chasing perfect uptime leads to brittle systems that break under pressure. By prioritising resilience over uptime, we:

  • Ensure customers experience reliable service, even in the face of failures.
  • Reduce the risk and impact of outages through proactive resilience engineering.
  • Empower teams to operate with confidence, knowing systems are designed to recover gracefully.
  • Enable faster incident resolution and continuous improvement through real-world testing.

Our Expectation
All teams must design for resilience, not just uptime, ensuring that failure scenarios are considered from the start. Leaders must foster a culture of continuous resilience improvement, encouraging teams to test, refine, and enhance system reliability.

To support this policy, resilience engineering frameworks, automated recovery mechanisms, and continuous failure testing practices will be embedded into our engineering workflows, ensuring that teams have the tools, insights, and strategies to maintain reliability. By making Resilience Over Uptime a core principle, we ensure that our systems are robust, adaptive, and always prepared - delivering Better Value Sooner Safer Happier.

This policy shifts the focus from fragile uptime guarantees to robust, resilient systems that recover and adapt.

Associated Standards
  • Changes are introduced with minimal failures and maximum resilience (CFR).
  • Operational readiness is tested before every major release.
  • Services are restored quickly and safely following failure (MTTR).
  • Systems recover quickly and fail safely.

Technical debt is like junk food - easy now, painful later.

Awesome Blogs
  • LinkedIn Engineering
  • Github Engineering
  • Uber Engineering
  • Code as Craft
  • Medium.engineering