Standard : Incidents drive platform and service improvements—not just RCA reports

Purpose and Strategic Importance

This standard ensures that incidents lead to meaningful improvements in services, platforms, or practices—not just retrospective documentation. The purpose of an incident review is not only to understand what happened, but to make sure it does not happen again.

It supports the policy “Run to Stop When Problems Arise” by shifting the focus from post-mortem reporting to actual learning and engineering action. When teams apply what they’ve learned, they reduce recurrence risk, strengthen systemic resilience, and build a culture of continuous improvement.

Strategic Impact

Moves incident response from reactive to proactive, closing learning loops
Reduces recurrence of avoidable issues through platform and process changes
Drives structural improvement across the tech estate (e.g. tooling, patterns, architecture)
Improves operational excellence and reduces operational toil over time
Demonstrates that time spent on incident management leads to tangible progress

Risks of Not Having This Standard

RCA documents are produced, but no changes are made to prevent recurrence
Engineers become disillusioned with incident processes that fail to drive change
Systemic issues remain unaddressed, degrading reliability over time
Platforms accumulate operational debt due to inaction
Incident reviews become compliance rituals, not learning moments

CMMI Maturity Model

Level 1 – Initial

Category	Description
People & Culture	- Incidents are handled in isolation, with little follow-up or shared learning.
Process & Governance	- Post-incident reviews are ad hoc or skipped entirely.
Technology & Tools	- No tooling to track follow-up actions from incidents.
Measurement & Metrics	- No visibility into whether root causes are addressed.

Level 2 – Managed

Category	Description
People & Culture	- Some teams hold basic incident reviews, but follow-through varies.
Process & Governance	- Basic RCAs are written, and action items may be logged manually.
Technology & Tools	- Shared documents are used to capture issues, but are not connected to engineering backlogs.
Measurement & Metrics	- Volume of incidents and RCAs are counted, but impact and closure are not tracked.

Level 3 – Defined

Category	Description
People & Culture	- Engineers treat incident analysis as a valuable opportunity to improve the system.
Process & Governance	- Incident reviews include clear root causes, contributing factors, and improvement actions.
Technology & Tools	- Incident tools integrate with engineering backlogs to track follow-ups to closure.
Measurement & Metrics	- Closure rates of improvement actions and recurrence rates are measured and reviewed.

Level 4 – Quantitatively Managed

Category	Description
People & Culture	- Teams prioritise systemic improvements over superficial fixes.
Process & Governance	- Incident improvement actions are time-bound, prioritised, and visible to stakeholders.
Technology & Tools	- Dashboards show action item completion, RCA quality, and incident-to-resolution time trends.
Measurement & Metrics	- Action effectiveness is reviewed to determine if similar incidents are decreasing in frequency and severity.

Level 5 – Optimising

Category	Description
People & Culture	- Post-incident learning is integrated into platform evolution, architectural decision-making, and risk planning.
Process & Governance	- Organisational rituals reinforce learning from incidents (e.g. learning reviews, incident community of practice).
Technology & Tools	- RCA data is analysed over time to surface systemic friction and trigger broader engineering initiatives.
Measurement & Metrics	- Impact reduction metrics, incident recurrence trendlines, and service reliability scores are reviewed at team and organisation levels.

Key Measures

Percentage of incidents with completed and verified improvement actions
Time to resolve follow-up actions from post-incident reviews
Trend in recurrence rate of similar incidents or failures
Engineer sentiment around the value of post-incident reviews
Operational improvements implemented as a result of learning reviews
RCA completion rate and action item quality (e.g. SMART criteria)