Standard : Incident Frequency
Description
Incident Frequency measures how often production incidents occur within a given timeframe—typically tracked weekly, monthly, or per release cycle. It provides a key indicator of operational reliability, system stability, and the effectiveness of your engineering and testing practices.
High incident frequency may signal technical debt, inadequate testing, brittle systems, or fast delivery without sufficient guardrails.
How to Use
What to Measure
- Count of incidents meeting your severity threshold (e.g. Sev 0–2) per team, service, or platform area.
- Segment by incident source (e.g. change, third-party, infrastructure) or root cause for deeper insights.
Incident Frequency = Number of Incidents / Time Period
Instrumentation Tips
- Use incident management systems (e.g. PagerDuty, Opsgenie, Jira) to track timestamps, severity, impact, and resolution.
- Define consistent criteria for incident logging and severity classification.
- Automate aggregation and reporting of incident metrics across teams.
Why It Matters
- Operational health: Reveals system reliability and resilience trends.
- Signal for improvement: Informs where engineering effort is needed most.
- Team focus: Helps balance time spent on firefighting vs. building.
- Customer trust: Frequent incidents impact SLAs and user satisfaction.
Best Practices
- Standardise severity levels and track consistently across all teams.
- Review trends in delivery, architecture, or process that may drive incidents.
- Pair this metric with MTTR and CFR for a holistic view of incident impact.
- Analyse root causes and track improvements via action items.
- Celebrate declining incident trends driven by structural fixes, not luck.
Common Pitfalls
- Inconsistent incident definitions leading to unreliable data.
- Only tracking major incidents and ignoring patterns in smaller ones.
- Treating incidents as isolated events rather than systemic signals.
- Failing to follow up with blameless reviews or improvement actions.
Signals of Success
- Decreasing incident rate over time without slowing down delivery.
- Improvements from postmortems measurably reduce recurrence.
- Incident trends are tracked and discussed in delivery reviews.
- Engineers feel supported in addressing root causes, not just symptoms.
- [[Change Failure Rate]]
- [[Mean Time to Recovery (MTTR)]]
- [[Error Budget Consumption]]
- [[Time to Remediate Vulnerabilities]]
- [[Automated Remediation Rate]]