Standard : Error Budget Consumption

Description

Error Budget Consumption tracks how much of your allowed service downtime (as defined by your SLOs) has been used within a given period. It enables teams to balance velocity and reliability by providing a quantifiable boundary for risk.

When too much of the error budget is consumed, teams pause or slow deployments and shift focus to reliability work. When budgets are healthy, teams can safely release at pace.

How to Use

What to Measure

Compare actual availability (or latency, error rate, etc.) against your defined SLO targets for a service.
Express consumption as a percentage of the allowed SLO threshold used.

Formula

Error Budget Consumption (%) = (SLO Budget Used / Total SLO Budget) x 100

Instrumentation Tips

Define clear SLOs with SLIs and track them through observability platforms (e.g. Prometheus, Datadog, New Relic).
Integrate error budget policies into delivery workflows and incident processes.
Track by service, team, or product area.

Why It Matters

Reliability guardrail: Prevents delivery velocity from compromising user experience.
Prioritisation signal: Informs trade-offs between feature work and resilience efforts.
Governance: Enables objective, data-informed decisions on risk tolerance.
Customer trust: Links internal metrics directly to external impact.

Best Practices

Define SLOs with input from both engineering and product teams.
Visualise error budgets on service dashboards and status pages.
Automate actions (e.g. deploy hold, review trigger) when thresholds are breached.
Include budget consumption in incident postmortems.
Use burn rates to track how quickly budgets are being consumed.

Common Pitfalls

Undefined or unrealistic SLOs that don’t reflect actual customer expectations.
Ignoring budget breaches or not linking them to delivery decisions.
Tracking error budgets manually or inconsistently.
Setting budgets too tightly, causing unnecessary friction.

Signals of Success

Teams are aware of and act on their error budget status.
Budget breaches result in meaningful improvement work.
Reliability trade-offs are discussed transparently with stakeholders.
Services meet or exceed SLOs consistently without slowing delivery pace.

[[Mean Time to Recovery (MTTR)]]
[[Change Failure Rate]]
[[Incident Frequency]]
[[Service Level Objective (SLO) Compliance]]
[[Automated Remediation Rate]]