• Home
  • BVSSH
  • Engineering Enablement
  • Playbooks
  • Frameworks
  • Good Reads
Search

What are you looking for?

Practice : Chaos Engineering

Purpose and Strategic Importance

Chaos Engineering is the practice of deliberately injecting faults into a system to test its resilience and ability to recover gracefully. Rather than waiting for failures in production, teams proactively explore weaknesses, validate assumptions, and build confidence in their systems under stress.

This practice shifts reliability from reactive firefighting to proactive design. It helps teams improve incident response, reduce downtime, and uncover hidden fragilities before they impact customers.


Description of the Practice

  • Controlled experiments introduce disruptions (e.g. latency, outages, resource exhaustion) to assess system behaviour.
  • Experiments are designed with hypotheses and expected outcomes.
  • Chaos is injected in a controlled, observable, and reversible manner - starting in non-production environments.
  • Insights from experiments lead to real improvements in architecture, code, or process.
  • Chaos Engineering is practiced regularly, not just after incidents.

How to Practise It (Playbook)

1. Getting Started

  • Identify a small, stable service where you can safely begin fault injection.
  • Form a hypothesis about what should happen when part of the system fails.
  • Use open source tools (e.g. Chaos Mesh, Litmus, Gremlin) or platform-native features to inject faults.
  • Monitor the system during and after the experiment to validate your assumptions.

2. Scaling and Maturing

  • Expand chaos scenarios to simulate network partitions, dependency failures, and slowdowns.
  • Run experiments in CI pipelines or dedicated chaos environments before production.
  • Include business metrics in observability to understand customer impact.
  • Conduct gamedays to rehearse incidents with real teams in realistic scenarios.
  • Document learnings and feed them into reliability and architectural improvements.

3. Team Behaviours to Encourage

  • Frame chaos as learning - not blame.
  • Prioritise recovery testing over perfect uptime.
  • Celebrate findings that expose fragility - they’re opportunities to improve.
  • Build a shared culture of reliability across engineering, product, and operations.

4. Watch Out For…

  • Injecting chaos into production before you’re ready - start in safe environments.
  • Running experiments without a clear hypothesis or observability.
  • Blaming teams for failures surfaced by chaos experiments.
  • Treating chaos as a one-off event - it must be a sustained practice.

5. Signals of Success

  • Teams routinely test their systems for resilience under failure conditions.
  • Incidents become less frequent, less severe, and more easily resolved.
  • Improvements from chaos experiments are tracked and acted upon.
  • System behaviours under stress are well understood and documented.
  • Confidence in recovery improves - teams move fast without fear.
Associated Standards
  • Changes are introduced with minimal failures and maximum resilience (CFR)
  • Data confidence levels are visible and understood at decision time
  • Infrastructure is version controlled and peer reviewed
  • Services are restored quickly and safely following failure (MTTR)
  • Systems recover quickly and fail safely
  • Teams frame and plan work around outcomes, not outputs
Associated Measures
  • Mean Time to Recovery (MTTR)
  • Mean Time to Detect (MTTD)
  • Automated Remediation Rate
  • Incident Frequency

Technical debt is like junk food - easy now, painful later.

Awesome Blogs
  • LinkedIn Engineering
  • Github Engineering
  • Uber Engineering
  • Code as Craft
  • Medium.engineering