Ragan McGill

Practice : Runbooks and Playbooks

Purpose and Strategic Importance

Runbooks and Playbooks provide step-by-step guidance for handling operational tasks, common procedures, and unexpected incidents. They reduce cognitive load, accelerate response, and increase confidence by making institutional knowledge explicit and accessible.

These documents help ensure that operational work - whether routine or emergent - is performed consistently, safely, and with a clear understanding of risks and expected outcomes. They are a foundational practice for resilient, scalable systems.

Description of the Practice

Runbooks document how to perform recurring tasks (e.g. deploy a service, restart a job, rotate a key).
Playbooks describe how to diagnose and recover from incidents (e.g. high latency, failed backups, user-facing outages).
Both include context, assumptions, checks, remediation steps, and escalation paths.
Stored in searchable, version-controlled locations (e.g. Git, Confluence, Ops platforms) and kept close to the systems they support.

How to Practise It (Playbook)

1. Getting Started

Identify critical services and common tasks or failure modes.
Start simple: title, context, preconditions, steps, verification, and escalation.
Standardise format and location - ensure teams know where to find and how to contribute.
Encourage engineers to use and validate documentation in real-time (e.g. during incidents).

2. Scaling and Maturing

Tag documents with service names, severity levels, and ownership metadata.
Link runbooks to monitoring alerts or incident dashboards for fast access.
Turn frequently executed runbooks into automated workflows (e.g. scripts, bots, self-healing tasks).
Periodically review and test runbooks through game days or chaos engineering scenarios.
Track usage to identify gaps, outdated steps, or opportunities for automation.

3. Team Behaviours to Encourage

Write for your future self - be clear, calm, and explicit.
Update documentation after every use or incident.
Share new runbooks during ops reviews or in Slack for awareness.
Treat runbooks as shared assets - everyone contributes, everyone benefits.

4. Watch Out For…

Documents becoming stale - if unused or unchecked, they decay.
Hidden knowledge - tribal workflows that aren’t captured.
Over-complexity - too many steps, assumptions, or jargon.
Playbooks without ownership - unclear who maintains or improves them.

5. Signals of Success

Incidents are handled faster, with less stress and fewer errors.
Operational tasks are performed consistently across shifts and teams.
Engineers trust documentation and use it by default, not as a last resort.
Playbooks evolve with the system and reflect real-world learnings.
Automation increases as confidence in documented processes grows.