• Home
  • BVSSH
  • Engineering Enablement
  • Playbooks
  • Frameworks
  • Good Reads
Search

What are you looking for?

Practice : Error Budget Policies

Purpose and Strategic Importance

Error Budget Policies define how much unreliability a system can tolerate over time - and what should happen when that budget is exhausted. They create a balanced approach between innovation and reliability, helping teams manage risk, prioritise work, and make trade-offs with data rather than opinion.

By grounding delivery decisions in service-level performance, error budgets align product and engineering goals with customer impact and system health - reinforcing shared ownership of availability and velocity.


Description of the Practice

  • An error budget is the allowable threshold of unreliability, calculated as 100% minus the agreed Service-Level Objective (SLO).
  • Teams track actual service performance (SLI) against the SLO to see how much of the budget remains.
  • Policies define actions when the error budget is breached - such as halting deployments, prioritising reliability work, or triggering review.
  • Common in Site Reliability Engineering (SRE), but applicable in any environment where teams balance speed and stability.

How to Practise It (Playbook)

1. Getting Started

  • Define SLOs for critical services (e.g. “99.9% of requests succeed within 300ms per month”).
  • Calculate the error budget (e.g. 0.1% of requests may fail in a month).
  • Visualise error budget burn in a dashboard and share it with the team.
  • Decide as a team what actions to take if the budget is trending toward depletion or fully consumed.

2. Scaling and Maturing

  • Define tiered response policies: inform, investigate, pause, rollback.
  • Align product and platform teams on the value of availability and trade-offs.
  • Use error budgets to plan engineering investment - e.g. invest in reliability when budget is low.
  • Pair budgets with incident reviews to identify systemic causes of depletion.
  • Incorporate budget burn trends into delivery planning and roadmap prioritisation.

3. Team Behaviours to Encourage

  • View budgets as shared resources - not just ops metrics.
  • Discuss reliability as part of product planning - not just after incidents.
  • Celebrate periods of healthy budget usage and improved stability.
  • Use budgets to ask better questions - “Should we ship this now?” or “Is this risk worth it?”

4. Watch Out For…

  • Unrealistic or unmeasured SLOs undermining budget accuracy.
  • Ignoring budget signals in the face of pressure to ship.
  • Blame culture when budgets are exhausted - budgets are learning signals.
  • Treating budgets as fixed rules rather than dynamic indicators of system health.

5. Signals of Success

  • SLOs and budgets are visible and understood by product and engineering teams.
  • Teams adjust priorities in response to reliability trends.
  • Fewer firefights and more proactive work on resilience and performance.
  • Product velocity and system stability improve together - not in conflict.
  • Error budgets become part of the organisation’s decision-making fabric.
Associated Standards
  • Changes are introduced into production frequently and sustainably (DF)
  • Delivery pace is sustainable and protects team wellbeing
  • Engineering lead time is minimised from start of work to safe deployment (LTFC)
  • Guardrails are co-designed by those closest to delivery
  • Teams are trusted to sunset their own systems and services
  • Teams embrace risk and learn from failure
  • Teams track time-in-status across their delivery flow
  • Work in progress reflects current business priorities
Associated Measures
  • Mean Time to Detect (MTTD)
  • Error Budget Consumption
  • Incident Frequency

Technical debt is like junk food - easy now, painful later.

Awesome Blogs
  • LinkedIn Engineering
  • Github Engineering
  • Uber Engineering
  • Code as Craft
  • Medium.engineering