Ragan McGill

Practice : Error Budget Policies

Purpose and Strategic Importance

Error Budget Policies define how much unreliability a system can tolerate over time - and what should happen when that budget is exhausted. They create a balanced approach between innovation and reliability, helping teams manage risk, prioritise work, and make trade-offs with data rather than opinion.

By grounding delivery decisions in service-level performance, error budgets align product and engineering goals with customer impact and system health - reinforcing shared ownership of availability and velocity.

Description of the Practice

An error budget is the allowable threshold of unreliability, calculated as 100% minus the agreed Service-Level Objective (SLO).
Teams track actual service performance (SLI) against the SLO to see how much of the budget remains.
Policies define actions when the error budget is breached - such as halting deployments, prioritising reliability work, or triggering review.
Common in Site Reliability Engineering (SRE), but applicable in any environment where teams balance speed and stability.

How to Practise It (Playbook)

1. Getting Started

Define SLOs for critical services (e.g. “99.9% of requests succeed within 300ms per month”).
Calculate the error budget (e.g. 0.1% of requests may fail in a month).
Visualise error budget burn in a dashboard and share it with the team.
Decide as a team what actions to take if the budget is trending toward depletion or fully consumed.

2. Scaling and Maturing

Define tiered response policies: inform, investigate, pause, rollback.
Align product and platform teams on the value of availability and trade-offs.
Use error budgets to plan engineering investment - e.g. invest in reliability when budget is low.
Pair budgets with incident reviews to identify systemic causes of depletion.
Incorporate budget burn trends into delivery planning and roadmap prioritisation.

3. Team Behaviours to Encourage

View budgets as shared resources - not just ops metrics.
Discuss reliability as part of product planning - not just after incidents.
Celebrate periods of healthy budget usage and improved stability.
Use budgets to ask better questions - “Should we ship this now?” or “Is this risk worth it?”

4. Watch Out For…

Unrealistic or unmeasured SLOs undermining budget accuracy.
Ignoring budget signals in the face of pressure to ship.
Blame culture when budgets are exhausted - budgets are learning signals.
Treating budgets as fixed rules rather than dynamic indicators of system health.

5. Signals of Success

SLOs and budgets are visible and understood by product and engineering teams.
Teams adjust priorities in response to reliability trends.
Fewer firefights and more proactive work on resilience and performance.
Product velocity and system stability improve together - not in conflict.
Error budgets become part of the organisation’s decision-making fabric.