• Home
  • BVSSH
  • Engineering Enablement
  • Playbooks
  • Frameworks
  • Good Reads
Search

What are you looking for?

Practice : On-Call Rotation Health Checks

Purpose and Strategic Importance

On-Call Rotation Health Checks ensure the sustainability, fairness, and effectiveness of on-call duties. They provide visibility into workload, fatigue, and organisational support - helping to protect team wellbeing while maintaining high service reliability.

When done consistently, these checks foster a healthy on-call culture where engineers feel supported, incidents are handled efficiently, and reliability practices improve through continuous feedback.


Description of the Practice

  • Health checks assess the quality and impact of on-call experiences - including alert volume, incident load, response effort, sleep disruption, and post-incident recovery.
  • Conducted at regular intervals (e.g. post-shift, monthly, quarterly) through surveys, reviews, or debriefs.
  • Insights guide improvements in tooling, automation, documentation, staffing, and compensation.
  • Findings are shared with leadership to prioritise support and investment in operational excellence.

How to Practise It (Playbook)

1. Getting Started

  • Create a simple retrospective format for post-on-call reviews (e.g. “What went well?”, “What was painful?”, “What should we fix?”).
  • Track basic metrics like number of pages, mean time to acknowledge/respond, and time-of-day for alerts.
  • Include wellbeing check-ins as part of the review - burnout and sleep loss matter.

2. Scaling and Maturing

  • Run quarterly on-call health surveys to gather anonymised, quantitative feedback across teams.
  • Review and tune alert thresholds, playbook coverage, and escalation policies based on feedback.
  • Automate health metrics collection from alerting systems, rota tools (e.g. PagerDuty, Opsgenie), and incident platforms.
  • Share themes and action plans with leadership to support systemic improvements.

3. Team Behaviours to Encourage

  • Be honest about pain points - emotional safety is as important as system safety.
  • Discuss on-call experiences as a team, not just privately with managers.
  • Normalise taking time off or deferring work after high-intensity shifts.
  • Recognise and celebrate operational excellence and responders’ efforts.

4. Watch Out For…

  • Feedback not being actioned - leads to apathy and resignation.
  • Rotations that overburden a few individuals or skills.
  • Normalising toil or heroics - long-term sustainability matters more.
  • Failing to invest in better tooling, documentation, or automation.

5. Signals of Success

  • Engineers feel supported and confident while on call.
  • Alert fatigue and burnout decrease over time.
  • On-call shifts become learning opportunities, not just burdens.
  • Improvements in reliability are matched by improvements in wellbeing.
  • Leadership takes on-call health as seriously as system uptime.
Associated Standards
  • Systems recover quickly and fail safely
  • Operational readiness is tested before every major release
  • Operational tasks are automated before they become recurring toil
  • Developer workflows are fast and frictionless
  • Product and engineering decisions are backed by live data

Technical debt is like junk food - easy now, painful later.

Awesome Blogs
  • LinkedIn Engineering
  • Github Engineering
  • Uber Engineering
  • Code as Craft
  • Medium.engineering