Skip to main content
Engineering team collaborating at a whiteboard during an incident review It’s 2:47 AM on a Tuesday. Your phone buzzes. A critical service is down. Real users, real impact, real consequences. What you do in the next 15 minutes determines whether this is a 20-minute incident or a 4-hour disaster. Those first 15 minutes matter not because that’s when you fix things — it’s when you set the trajectory. Good first 15 minutes: clear ownership, fast triage, calm communication. Bad first 15 minutes: confusion, panic, three engineers debugging the same thing while nobody talks to the customer-facing team. I’ve been on both sides of this, and the difference is always the process, not the people.
“The problem is never how to get new, innovative thoughts into your mind, but how to get the old ones out.” — Dee Hock (Visa founder, on why clear process beats heroics)

The First 15 Minutes: A Timeline

Here’s what should happen, minute by minute. Practice this before you need it — the first time you follow the playbook should never be during a real incident.
TimeActionOwner
T+0Alert fires, on-call acknowledgesOn-call engineer
T+2 minSeverity assessment (SEV 1-4)On-call engineer
T+3 minIncident channel created, roles assignedOn-call engineer
T+5 minFirst status update to stakeholdersCommunicator
T+10 minInitial hypothesis formed, first mitigation attemptedResolver(s)
T+15 minEscalation decision — do we need more people?Incident commander
T+30 minSecond status update, refined diagnosisCommunicator
T+60 minHourly updates until resolutionCommunicator
ResolutionFinal update, channel archivedCommander
T+48 hrsPostmortem written and reviewedAll participants
When in doubt about severity, escalate up. A SEV 2 that gets upgraded to SEV 1 after 30 minutes wastes 30 minutes. A SEV 1 that gets downgraded to SEV 2 after 5 minutes wastes 5 minutes. Always err toward higher severity.

Severity Levels

Getting severity wrong has real costs. Too low, and a serious outage simmers while one engineer casually investigates. Too high, and you wake up leadership for a broken tooltip. Use a clear rubric:
SeverityWhat It MeansExpected ResponseExample
SEV 1Customer-facing outage, data loss risk, or financial impactAll hands, war room, executive communicationPayment processing down
SEV 2Major feature degraded, affecting many usersOn-call + team lead, regular status updatesSearch returning stale results for 30+ minutes
SEV 3Minor feature broken, workaround existsOn-call investigates during business hoursCSV export timing out for large datasets
SEV 4Cosmetic issue, no functional impactTicket created, fixed in normal sprintButton misaligned on one page

The Three Roles

Every incident needs exactly three roles. Not five, not “everyone jump in” — three. When roles are unclear, people either duplicate work or assume someone else is handling it. Incident Commander (IC) — Owns the incident. Makes decisions about escalation, mitigation strategy, and when to declare resolution. The IC doesn’t debug — they coordinate. Think of them as air traffic control, not the pilot. Communicator — Writes status updates, handles Slack messages to stakeholders, updates the status page, and fields questions from non-engineering teams. This role exists so the resolver can focus entirely on fixing things. Resolver — The engineer(s) actually debugging and fixing the issue. They share findings in the incident channel but don’t write external communications or answer “what’s the ETA?” questions.
The communicator role is the most undervalued and most impactful. I’ve seen incidents with a dedicated communicator resolve 40% faster — not because communication fixes bugs, but because it frees the resolver from constant context-switching.

Communication Templates

Under stress, people write terrible updates. “We’re looking into it” tells stakeholders nothing. Templates solve this by removing the need to think about format when you should be thinking about the problem. First status update (T+5 min):
INCIDENT: [Brief description] Severity: SEV [1-4] Impact: [Who is affected and how] Status: Investigating Commander: @[name] We detected [what happened] at [time]. Currently investigating root cause. [Number] customers are affected. Next update in 25 minutes.
Progress update (T+30 min, then hourly):
UPDATE: [Brief description] Status: Investigating / Identified / Mitigating / Resolved What we know: [Current understanding] What we’re doing: [Specific mitigation steps] Impact: [Updated scope] ETA: [Honest estimate, or “unknown — next update in 30 min”]
Resolution update:
RESOLVED: [Brief description] Duration: [Total time from detection to resolution] Root cause: [One sentence] Customer impact: [Who was affected, for how long] A full postmortem will be published within 48 hours.

War Room Rules

When multiple people are in an incident channel, chaos is the default. These rules keep things focused:
  • Prefix messages with your role. [RESOLVER] Checking database metrics vs [COMMS] Posting status page update. Prevents crosstalk.
  • No drive-by debugging. If you’re not an assigned resolver, don’t start running queries. Coordinate with the IC first.
  • Thread long investigations. The main channel should be a clean timeline. Deep dives go in threads.
  • Announce before you act. “I’m going to restart the API pods” gives the IC a chance to say “wait — let me capture metrics first.”
  • Update even when nothing has changed. Silence during an incident is terrifying. “Still investigating, no new findings” is better than 20 minutes of nothing.

The Postmortem: Structure That Actually Works

Most postmortems are useless — they describe what happened, assign vague action items like “improve monitoring,” and are never read again. Use this structure instead:
SectionWhat to IncludeWhy It Matters
Summary2-3 sentences: what happened, who was affected, how it was resolvedQuick context for anyone reading later
ImpactUsers affected, revenue impact, duration of customer-facing degradationQuantifies the severity honestly
Timeline (UTC)Minute-by-minute log from detection to resolutionReveals where time was lost
Root CauseDetailed technical explanation — not “the DB was slow” but the specific query, PR, and missing indexPrevents vague understanding
What Went WellFast detection, clear runbooks, quick mitigationReinforces good practices
What Didn’t Go WellMissed in code review, no load test, delayed status updateIdentifies systemic gaps
Action ItemsEach with an owner, due date, and ticket numberThe only part that prevents repeats
“Improve monitoring” is not an action item. “Add alert for connection pool utilization above 80%, owned by @engineer, due March 15, tracked in ENG-1234” is an action item. Every action item needs an owner, a due date, and a ticket — or it won’t happen.

Building an Incident Culture

The hardest part isn’t the process — it’s the culture. Blameless postmortems sound good in theory; in practice, they require active enforcement.
  • Never ask “who broke this?” Ask “what made this breakage possible?” The human who wrote the bug is the least interesting factor. The system that allowed it to reach production is what you fix.
  • Celebrate fast detection and clean response. Highlight incidents that were handled well. The message isn’t “we had fewer incidents” — it’s “we responded in 4 minutes and zero customers noticed.”
  • Make incident participation career-positive. If someone runs a great incident, mention it in their performance review. If on-call work is invisible, your best people will avoid it.
  • Review postmortem action items monthly. Incomplete action items are the number one way incidents repeat.
The goal isn’t zero incidents — that’s impossible at scale. The goal is that every incident makes the system more resilient, and no incident happens the same way twice.