Practice : AI On-Call and Incident Ownership
Purpose and Strategic Importance
AI systems in production require the same operational discipline as any production software system — including clear ownership, defined on-call responsibilities, and structured incident response. Without this discipline, AI incidents are handled inconsistently, responsibilities are unclear, and response quality degrades under pressure. The team that built the model may not be the team that operates it; the team that operates it may not have the knowledge needed to diagnose complex model failures. Clear on-call and incident ownership structures bridge these gaps.
On-call practices also signal the team's commitment to the users and processes that depend on their AI systems. A team that cannot respond reliably to AI incidents is implicitly saying that production reliability is not a priority — a message that erodes trust with the business stakeholders and users who depend on AI systems to do their work. Structured on-call practices demonstrate operational maturity and build the confidence of everyone who relies on the team's AI systems.
Description of the Practice
- Establishes clear on-call ownership for every production AI system, with named individuals accountable for initial response to alerts and incidents within defined SLAs.
- Maintains on-call rotations that distribute the operational burden fairly across the team, preventing on-call burnout while ensuring that coverage is always available.
- Develops and maintains runbooks for common AI incident types — model performance degradation, data pipeline failures, serving infrastructure issues — that enable any on-call engineer to respond effectively.
- Defines escalation paths for incidents beyond the on-call engineer's authority or expertise, ensuring that complex or high-impact incidents are escalated promptly to the right people.
- Measures on-call load and incident response effectiveness, using this data to drive improvements in system reliability, runbook quality, and on-call experience.
How to Practise It (Playbook)
1. Getting Started
- Identify all production AI systems and ensure each has a named primary owner and at least one backup on-call engineer who is briefed on the system's architecture and common failure modes.
- Create a basic on-call runbook for your most important AI system, covering the steps to take for the most common alert conditions, including how to roll back to a previous model version.
- Establish an on-call schedule with fair distribution, defined coverage hours appropriate to the system's use patterns, and a clear process for managing schedule conflicts and escalations.
- Define on-call SLAs — how quickly must the on-call engineer acknowledge an alert, and how quickly must initial response begin — and communicate these to the stakeholders who depend on them.
2. Scaling and Maturing
- Build on-call tooling integration — alert management, incident tracking, and communication channels — that reduces the friction of responding to incidents and supports coordination during complex events.
- Develop a runbook library covering the full range of incident types encountered in production, maintained and updated as new incident patterns emerge and existing runbooks prove inadequate.
- Implement regular on-call experience reviews — separate from post-incident reviews — that collect feedback from on-call engineers on what is making incidents harder or easier to handle.
- Create an on-call enablement programme that brings new team members up to speed on AI operational practices before they take their first on-call shift.
3. Team Behaviours to Encourage
- Treat on-call as a shared team responsibility, not a burden assigned to the most junior engineers or to the people who built the system most recently.
- Use on-call experience to drive reliability improvements — every incident that required significant diagnostic effort is an opportunity to improve monitoring, runbooks, or system design to make the next similar incident faster to resolve.
- Never ignore alerts, even when they appear low-priority — alert patterns often reveal emerging problems that are significant before the aggregate impact becomes visible in headline metrics.
- Protect on-call engineers' time after significant incidents — follow-on fatigue is a real operational risk, and teams that do not manage it create the conditions for the next incident.
4. Watch Out For…
- On-call practices that exist for software engineers but do not extend to AI-specific on-call scenarios — model performance degradation requires different diagnostic skills and runbooks than infrastructure failures.
- Runbooks that are incomplete, outdated, or so generic that they provide no practical guidance for the specific systems they are supposed to cover.
- On-call rotations that are theoretically fair but practically burdensome because the system is not sufficiently reliable — the sustainable solution is improving system reliability, not normalising high on-call load.
- On-call ownership that is assigned but not staffed with knowledge — a named on-call engineer who does not understand the AI system they are responsible for cannot provide effective incident response.
5. Signals of Success
- Every production AI system has named, briefed on-call coverage with up-to-date runbooks and a tested escalation path.
- On-call SLAs for alert acknowledgement and initial response are consistently met, with SLA compliance tracked and reported.
- On-call engineers report that incidents are manageable using available runbooks and tooling, and that post-incident improvements are reducing the frequency and complexity of recurrent incidents.
- On-call load is distributed fairly across the team, with no individual experiencing significantly more on-call burden than their peers over time.
- The number of incidents requiring manual runbook escalation is decreasing as automation and better monitoring reduce the manual response burden on on-call engineers.