Practice : On-Call Rotation Health Checks
Purpose and Strategic Importance
On-Call Rotation Health Checks ensure the sustainability, fairness, and effectiveness of on-call duties. They provide visibility into workload, fatigue, and organisational support - helping to protect team wellbeing while maintaining high service reliability.
When done consistently, these checks foster a healthy on-call culture where engineers feel supported, incidents are handled efficiently, and reliability practices improve through continuous feedback.
Description of the Practice
- Health checks assess the quality and impact of on-call experiences - including alert volume, incident load, response effort, sleep disruption, and post-incident recovery.
- Conducted at regular intervals (e.g. post-shift, monthly, quarterly) through surveys, reviews, or debriefs.
- Insights guide improvements in tooling, automation, documentation, staffing, and compensation.
- Findings are shared with leadership to prioritise support and investment in operational excellence.
How to Practise It (Playbook)
1. Getting Started
- Create a simple retrospective format for post-on-call reviews (e.g. “What went well?”, “What was painful?”, “What should we fix?”).
- Track basic metrics like number of pages, mean time to acknowledge/respond, and time-of-day for alerts.
- Include wellbeing check-ins as part of the review - burnout and sleep loss matter.
2. Scaling and Maturing
- Run quarterly on-call health surveys to gather anonymised, quantitative feedback across teams.
- Review and tune alert thresholds, playbook coverage, and escalation policies based on feedback.
- Automate health metrics collection from alerting systems, rota tools (e.g. PagerDuty, Opsgenie), and incident platforms.
- Share themes and action plans with leadership to support systemic improvements.
3. Team Behaviours to Encourage
- Be honest about pain points - emotional safety is as important as system safety.
- Discuss on-call experiences as a team, not just privately with managers.
- Normalise taking time off or deferring work after high-intensity shifts.
- Recognise and celebrate operational excellence and responders’ efforts.
4. Watch Out For…
- Feedback not being actioned - leads to apathy and resignation.
- Rotations that overburden a few individuals or skills.
- Normalising toil or heroics - long-term sustainability matters more.
- Failing to invest in better tooling, documentation, or automation.
5. Signals of Success
- Engineers feel supported and confident while on call.
- Alert fatigue and burnout decrease over time.
- On-call shifts become learning opportunities, not just burdens.
- Improvements in reliability are matched by improvements in wellbeing.
- Leadership takes on-call health as seriously as system uptime.