Incidents Are Inevitable
Every complex system will fail. The question is not whether failures will occur but when, how severely, and how quickly you will detect and recover from them.
Engineering organisations that treat incidents as evidence of failure create environments where people hide problems, avoid deployment, and optimise for looking good rather than learning. Engineering organisations that treat incidents as inevitable events to be managed and learned from create environments where failures surface quickly, are resolved efficiently, and make the system stronger.
The incident management practices described here are built on that second foundation.
Incident Severity Classification
A clear severity classification scheme is the foundation of incident management. It answers the first question anyone needs to ask: how serious is this?
Severity classifications should map to concrete impact criteria, not vague descriptions. A common four-level scheme:
Severity 1 - Critical: Full service outage or severe degradation affecting all or most users. Payment processing failing. Authentication unavailable. Data loss occurring. Immediate response required regardless of time of day.
Severity 2 - Major: Significant feature or service degradation affecting a substantial proportion of users, or limited impact on a critical capability. Primary response should begin within minutes during business hours, within an hour at any time.
Severity 3 - Minor: Limited impact, workaround available, small proportion of users affected. Response within business hours is appropriate.
Severity 4 - Low: Cosmetic issues, minor errors, no user impact. Tracked and addressed through normal development process.
The classification should be based on observable impact, not on gut feel about how bad it might get. A partial outage that affects 5% of users is not a Sev 1 because it feels scary - it is classified by the actual impact criteria. This consistency reduces the escalation drama that occurs when severity is subjective.
Document the expected response for each severity: who is paged, what the initial response time expectation is, what communication is required.
On-Call and Its Human Cost
On-call is the practice of engineers being available outside normal hours to respond to production incidents. It is necessary for services with real-time availability requirements. It also carries significant human cost that many engineering organisations manage poorly.
Alert Fatigue
The most common on-call failure mode is alert fatigue. Engineers receive so many alerts, many of which resolve without action or are not actionable, that they begin to tune them out. A genuine Sev 1 alert arriving in a sea of noise is missed or de-prioritised.
Alert quality matters more than alert quantity. Every alert should represent a condition that requires human attention. Alerts that fire for conditions that resolve automatically, that are informational rather than actionable, or that are duplicates of other alerts should be eliminated. The goal is an on-call experience where every alert requires and receives genuine attention.
On-Call Load Distribution
On-call load should be distributed fairly across a team. A rotation that places the same one or two engineers on call repeatedly is both unsustainable and a single point of failure. Broader participation in on-call also distributes knowledge of production systems - engineers who have to respond to production incidents develop a different understanding of the system than engineers who only write code.
On-Call Compensation and Recovery Time
Engineers who carry on-call responsibility outside normal hours should be compensated for that responsibility and for time spent on incidents. The specific mechanism varies by organisation and jurisdiction, but the principle is clear: on-call is work, and work should be compensated.
Following a night of significant incident response, engineers need recovery time. An engineer who spent four hours resolving a 3am incident should not be expected to perform at full capacity the following day. Build recovery time into your on-call expectations.
The Incident Commander Role
For Sev 1 and Sev 2 incidents, establishing clear incident command structure is essential. Without it, incidents become coordination failures: multiple people investigating independently, duplicate work, conflicting communications, and lack of clear ownership for decisions.
The incident commander (IC) is the person coordinating the response. They are not necessarily the person doing the investigation - they are the person who owns the process and ensures the response is organised effectively.
The IC's responsibilities during an incident:
- Maintain a shared understanding of current status among all responders
- Ensure the right people are engaged and doing the right things
- Own external communication - what goes to stakeholders, when, and what it says
- Make decisions when there is ambiguity or disagreement among responders
- Track actions and ensure nothing falls through the cracks
- Declare the incident resolved when service is restored and stabilised
The IC role can be filled by anyone with appropriate training and authority - it does not need to be the most senior technical person. In fact, keeping the most senior technical people free to investigate while a dedicated IC handles coordination usually produces faster resolution.
Communication During an Incident
Communication failures during incidents are at least as common as technical failures. The wrong people are not informed, the right people receive conflicting information, and stakeholders learn about outages from customer complaints rather than from the engineering team.
Internal Communication
Establish a dedicated incident communication channel for each significant incident. A Slack channel named for the incident allows all communication to be in one place, searchable after the fact, and clear about the timeline of events.
The IC should post status updates in this channel at regular intervals - every fifteen minutes for Sev 1 incidents is a common standard. Each update should cover: current status, what is being investigated, what actions are in progress, and expected next update time.
External Communication
For incidents affecting users, external communication is required. Define in advance who has authority to post public status updates, what level of incident triggers public communication, and what the communication template looks like.
The worst communications during incidents are vague, late, and defensive. The best are specific about impact, honest about uncertainty, proactive rather than reactive, and focused on what is being done rather than on minimising apparent severity.
"We are investigating reports of slow loading times" - vague and passive.
"We are experiencing elevated error rates on our checkout service. Approximately 15% of checkout attempts are failing. We have identified a likely cause and are working on a fix. We will update in 30 minutes." - specific, honest, and informative.
Blameless Postmortems
A postmortem is a structured review of a significant incident, conducted after service is restored. The purpose is to understand what happened and why, and to identify changes that will prevent recurrence or reduce impact in future.
The blameless postmortem - sometimes called a blame-free postmortem or learning review - is built on the principle that people generally make the best decisions they can with the information and tools available to them. When things go wrong, the cause is almost always systemic - the information was wrong, the tools were inadequate, the process was unclear - rather than individual incompetence or malice.
How to Run One
A good postmortem follows a structured format:
Timeline reconstruction. Build a shared timeline of events from the first symptom to full restoration. This is done collaboratively, with input from everyone involved in the response. The timeline often reveals surprising things about when symptoms first appeared, how long detection took, and where response time was lost.
Contributing factors. For each element of the timeline that contributed to the incident or extended its duration, ask "what conditions made this possible?" Focus on systemic factors: monitoring gaps, unclear runbooks, insufficient testing, deployment practices, access controls.
What went well. Explicitly document what worked during the incident. Detection was fast. The runbook was accurate. The IC kept communication clear. Understanding what works preserves it.
Action items. Concrete actions with owners and due dates. Not aspirational goals - specific changes to be made. Each action should address a contributing factor identified in the analysis.
What the postmortem should not include. Do not name individuals as the cause of the incident. Do not use language that assigns blame. "The engineer deployed broken code" is blameful. "The deployment pipeline did not catch the regression before production" is analytical.
The 24-48 Hour Rule
Postmortems should be conducted within 24-48 hours of incident resolution for Sev 1 incidents, while the events are fresh and the timeline can be accurately reconstructed. Postmortems conducted a week later rely on imperfect memory and produce less accurate analyses.
Action Item Tracking
Postmortems produce action items. Action items that are not tracked and closed are waste. The postmortem produces the feeling of improvement without the substance of it.
Track postmortem action items in the same place as other engineering work - the team backlog. Assign them to specific people. Set due dates. Review open postmortem actions as a standing item in retrospectives.
Measure the close rate for postmortem actions over time. A team with a high close rate is learning from incidents. A team with a low close rate is completing the ceremony of postmortems without getting the benefit.
Building a Learning Culture
The difference between an incident response culture that improves the organisation and one that just exhausts people is learning.
Incidents contain information. Information about where monitoring is insufficient. About where the runbooks are wrong. About where the architecture is fragile. About where the on-call process breaks down.
When that information is extracted through good postmortem practice and turned into system improvements, incidents strengthen the organisation. When the information is discarded because the incident was painful and everyone wants to move on, the same incidents repeat.
Engineering leaders set the culture through their behaviour during and after incidents. Leaders who ask "who caused this?" after an incident create blame cultures. Leaders who ask "what conditions made this possible and what would change them?" create learning cultures. The difference in outcomes over a year of incidents is substantial.