Practice : Error Budget Policies
Purpose and Strategic Importance
Error Budget Policies define how much unreliability a system can tolerate over time - and what should happen when that budget is exhausted. They create a balanced approach between innovation and reliability, helping teams manage risk, prioritise work, and make trade-offs with data rather than opinion.
By grounding delivery decisions in service-level performance, error budgets align product and engineering goals with customer impact and system health - reinforcing shared ownership of availability and velocity.
Description of the Practice
- An error budget is the allowable threshold of unreliability, calculated as 100% minus the agreed Service-Level Objective (SLO).
- Teams track actual service performance (SLI) against the SLO to see how much of the budget remains.
- Policies define actions when the error budget is breached - such as halting deployments, prioritising reliability work, or triggering review.
- Common in Site Reliability Engineering (SRE), but applicable in any environment where teams balance speed and stability.
How to Practise It (Playbook)
1. Getting Started
- Define SLOs for critical services (e.g. “99.9% of requests succeed within 300ms per month”).
- Calculate the error budget (e.g. 0.1% of requests may fail in a month).
- Visualise error budget burn in a dashboard and share it with the team.
- Decide as a team what actions to take if the budget is trending toward depletion or fully consumed.
2. Scaling and Maturing
- Define tiered response policies: inform, investigate, pause, rollback.
- Align product and platform teams on the value of availability and trade-offs.
- Use error budgets to plan engineering investment - e.g. invest in reliability when budget is low.
- Pair budgets with incident reviews to identify systemic causes of depletion.
- Incorporate budget burn trends into delivery planning and roadmap prioritisation.
3. Team Behaviours to Encourage
- View budgets as shared resources - not just ops metrics.
- Discuss reliability as part of product planning - not just after incidents.
- Celebrate periods of healthy budget usage and improved stability.
- Use budgets to ask better questions - “Should we ship this now?” or “Is this risk worth it?”
4. Watch Out For…
- Unrealistic or unmeasured SLOs undermining budget accuracy.
- Ignoring budget signals in the face of pressure to ship.
- Blame culture when budgets are exhausted - budgets are learning signals.
- Treating budgets as fixed rules rather than dynamic indicators of system health.
5. Signals of Success
- SLOs and budgets are visible and understood by product and engineering teams.
- Teams adjust priorities in response to reliability trends.
- Fewer firefights and more proactive work on resilience and performance.
- Product velocity and system stability improve together - not in conflict.
- Error budgets become part of the organisation’s decision-making fabric.