Standard : Change Failure Rate

Description

Change Failure Rate is a DORA metric that measures the percentage of code, data, or infrastructure changes that result in degraded service, incidents, hotfixes, or rollbacks after deployment. It indicates the quality and reliability of releases and the effectiveness of upstream quality controls.

A lower change failure rate demonstrates well-tested, resilient changes and robust delivery practices.

How to Use

What to Measure

Count of changes deployed to production.
Count of those changes that caused a failure and required immediate remediation (e.g. rollback, fix-forward, hotfix, config reversal).
Measure across software, infrastructure, and data environments.

Formula

Change Failure Rate = (Failed Changes / Total Changes) × 100

You can also break this down by:

Environment (e.g. staging vs production).
Service or team.
Severity of failure (e.g. customer-visible vs internal).

Instrumentation Tips

Link CI/CD pipelines with incident management systems or monitoring alerts.
Define what constitutes a “failure” clearly and apply consistently.
Use deployment tags and changelogs to correlate changes with downstream impact.
Surface this metric in team health dashboards and ops reviews.

Why It Matters

Builds confidence in releases: A high CFR signals unstable pipelines, rushed changes, or weak testing.
Protects customer trust: Failed changes increase risk, downtime, and reputational damage.
Encourages quality-first thinking: Prioritises safe deployment practices, small changes, and good rollback strategies.

Best Practices

Make changes smaller and more frequent to reduce blast radius.
Use deployment strategies like blue/green, canary, and shadow releases.
Automate pre-deployment checks and post-deployment monitoring.
Include CFR in platform OKRs or engineering quality objectives.
Enable fast rollback or fix-forward with automated mitigation plans.

Common Pitfalls

Not defining “failure” clearly—leading to inconsistent reporting.
Underreporting failures to protect team reputation.
Treating rollback as success instead of a last resort.
Ignoring non-functional failures (e.g. performance drops, data drift).

Signals of Success

CFR is consistently low and understood by both teams and stakeholders.
Failed changes are investigated, not repeated.
Engineering teams are confident in safe delivery practices.
CFR is used to shape platform improvements, not blame.

[[Defect Escape Rate]]
[[CoE/Engineering/Measures/Observability & Detection/Mean Time to Detect (MTTD)]]
[[Deployment Frequency]]
[[Lead Time for Change]]
[[Test Coverage of Critical Paths]]