• Home
  • BVSSH
  • C4E
  • Playbooks
  • Frameworks
  • Good Reads
Search

What are you looking for?

Standard : Incidents drive platform and service improvements—not just RCA reports

Purpose and Strategic Importance

This standard ensures that incidents lead to meaningful improvements in services, platforms, or practices—not just retrospective documentation. The purpose of an incident review is not only to understand what happened, but to make sure it does not happen again.

It supports the policy “Run to Stop When Problems Arise” by shifting the focus from post-mortem reporting to actual learning and engineering action. When teams apply what they’ve learned, they reduce recurrence risk, strengthen systemic resilience, and build a culture of continuous improvement.

Strategic Impact

  • Moves incident response from reactive to proactive, closing learning loops
  • Reduces recurrence of avoidable issues through platform and process changes
  • Drives structural improvement across the tech estate (e.g. tooling, patterns, architecture)
  • Improves operational excellence and reduces operational toil over time
  • Demonstrates that time spent on incident management leads to tangible progress

Risks of Not Having This Standard

  • RCA documents are produced, but no changes are made to prevent recurrence
  • Engineers become disillusioned with incident processes that fail to drive change
  • Systemic issues remain unaddressed, degrading reliability over time
  • Platforms accumulate operational debt due to inaction
  • Incident reviews become compliance rituals, not learning moments

CMMI Maturity Model

Level 1 – Initial

Category Description
People & Culture - Incidents are handled in isolation, with little follow-up or shared learning.
Process & Governance - Post-incident reviews are ad hoc or skipped entirely.
Technology & Tools - No tooling to track follow-up actions from incidents.
Measurement & Metrics - No visibility into whether root causes are addressed.

Level 2 – Managed

Category Description
People & Culture - Some teams hold basic incident reviews, but follow-through varies.
Process & Governance - Basic RCAs are written, and action items may be logged manually.
Technology & Tools - Shared documents are used to capture issues, but are not connected to engineering backlogs.
Measurement & Metrics - Volume of incidents and RCAs are counted, but impact and closure are not tracked.

Level 3 – Defined

Category Description
People & Culture - Engineers treat incident analysis as a valuable opportunity to improve the system.
Process & Governance - Incident reviews include clear root causes, contributing factors, and improvement actions.
Technology & Tools - Incident tools integrate with engineering backlogs to track follow-ups to closure.
Measurement & Metrics - Closure rates of improvement actions and recurrence rates are measured and reviewed.

Level 4 – Quantitatively Managed

Category Description
People & Culture - Teams prioritise systemic improvements over superficial fixes.
Process & Governance - Incident improvement actions are time-bound, prioritised, and visible to stakeholders.
Technology & Tools - Dashboards show action item completion, RCA quality, and incident-to-resolution time trends.
Measurement & Metrics - Action effectiveness is reviewed to determine if similar incidents are decreasing in frequency and severity.

Level 5 – Optimising

Category Description
People & Culture - Post-incident learning is integrated into platform evolution, architectural decision-making, and risk planning.
Process & Governance - Organisational rituals reinforce learning from incidents (e.g. learning reviews, incident community of practice).
Technology & Tools - RCA data is analysed over time to surface systemic friction and trigger broader engineering initiatives.
Measurement & Metrics - Impact reduction metrics, incident recurrence trendlines, and service reliability scores are reviewed at team and organisation levels.

Key Measures

  • Percentage of incidents with completed and verified improvement actions
  • Time to resolve follow-up actions from post-incident reviews
  • Trend in recurrence rate of similar incidents or failures
  • Engineer sentiment around the value of post-incident reviews
  • Operational improvements implemented as a result of learning reviews
  • RCA completion rate and action item quality (e.g. SMART criteria)
Associated Policies

Technical debt is like junk food - easy now, painful later.

Awesome Blogs
  • LinkedIn Engineering
  • Github Engineering
  • Uber Engineering
  • Code as Craft
  • Medium.engineering