Practice : Root Cause Analysis (RCA)
Purpose and Strategic Importance
Root Cause Analysis (RCA) is the process of systematically identifying the underlying reasons why an incident or failure occurred. It helps teams move beyond surface-level symptoms to understand systemic causes and implement lasting improvements.
RCA supports a culture of learning and continuous improvement. When practiced effectively and blamelessly, it transforms incidents into valuable opportunities for growth and resilience.
Description of the Practice
- RCA seeks to answer not just what failed, but why it failed - and why that failure wasn’t caught or mitigated sooner.
- It is often performed after major incidents, outages, or repeated issues.
- Techniques include the “5 Whys”, Fishbone Diagrams, Causal Trees, and Fault Analysis.
- RCAs are documented and shared to inform future design, monitoring, and operational improvements.
- The goal is not to assign blame, but to understand contributing factors across people, process, and technology.
How to Practise It (Playbook)
1. Getting Started
- Trigger an RCA after defined thresholds (e.g. Sev 1 incidents, recurring alerts).
- Use a structured template:
- Summary of what happened
- Timeline of events
- Immediate and contributing causes
- Missed detection or mitigation steps
- Actionable improvements
- Facilitate collaboratively and cross-functionally—include those closest to the incident.
2. Scaling and Maturing
- Use the “5 Whys” or “Causal Tree” method to dig beneath the obvious.
- Store RCA reports in a searchable repository and reference them in architectural discussions.
- Link RCA outcomes to improvements in monitoring, testing, automation, or training.
- Review RCA quality in retros and operational health reviews.
3. Team Behaviours to Encourage
- Focus on learning, not fault-finding—make it safe to surface mistakes.
- Discuss contributing factors in terms of systems, not individuals.
- Reflect on cultural or organisational enablers (e.g. underinvestment, poor handovers).
- Encourage teams to share RCA outcomes across squads to prevent recurrence.
4. Watch Out For…
- Shallow analysis that stops at symptoms (e.g. “someone forgot”).
- Postmortems that focus on the timeline but skip the root causes.
- Failure to follow through on improvement actions.
- Blaming tools, individuals, or teams instead of surfacing system gaps.
5. Signals of Success
- RCA findings lead to meaningful, lasting changes in systems or processes.
- Action items from RCAs are completed and tracked.
- Repeat incidents are rare, and incident impact is reduced over time.
- Teams trust the process and participate openly and constructively.
- RCA outputs influence platform investments, architecture, and automation strategies.