Ragan McGill

Practice : Chaos Engineering

Purpose and Strategic Importance

Chaos Engineering is the practice of deliberately injecting faults into a system to test its resilience and ability to recover gracefully. Rather than waiting for failures in production, teams proactively explore weaknesses, validate assumptions, and build confidence in their systems under stress.

This practice shifts reliability from reactive firefighting to proactive design. It helps teams improve incident response, reduce downtime, and uncover hidden fragilities before they impact customers.

Description of the Practice

Controlled experiments introduce disruptions (e.g. latency, outages, resource exhaustion) to assess system behaviour.
Experiments are designed with hypotheses and expected outcomes.
Chaos is injected in a controlled, observable, and reversible manner - starting in non-production environments.
Insights from experiments lead to real improvements in architecture, code, or process.
Chaos Engineering is practiced regularly, not just after incidents.

How to Practise It (Playbook)

1. Getting Started

Identify a small, stable service where you can safely begin fault injection.
Form a hypothesis about what should happen when part of the system fails.
Use open source tools (e.g. Chaos Mesh, Litmus, Gremlin) or platform-native features to inject faults.
Monitor the system during and after the experiment to validate your assumptions.

2. Scaling and Maturing

Expand chaos scenarios to simulate network partitions, dependency failures, and slowdowns.
Run experiments in CI pipelines or dedicated chaos environments before production.
Include business metrics in observability to understand customer impact.
Conduct gamedays to rehearse incidents with real teams in realistic scenarios.
Document learnings and feed them into reliability and architectural improvements.

3. Team Behaviours to Encourage

Frame chaos as learning - not blame.
Prioritise recovery testing over perfect uptime.
Celebrate findings that expose fragility - they’re opportunities to improve.
Build a shared culture of reliability across engineering, product, and operations.

4. Watch Out For…

Injecting chaos into production before you’re ready - start in safe environments.
Running experiments without a clear hypothesis or observability.
Blaming teams for failures surfaced by chaos experiments.
Treating chaos as a one-off event - it must be a sustained practice.

5. Signals of Success

Teams routinely test their systems for resilience under failure conditions.
Incidents become less frequent, less severe, and more easily resolved.
Improvements from chaos experiments are tracked and acted upon.
System behaviours under stress are well understood and documented.
Confidence in recovery improves - teams move fast without fear.