It’s 2:47 AM on a Tuesday. Your phone buzzes. A critical service is down. Real users, real impact, real consequences. What you do in the next 15 minutes determines whether this is a 20-minute incident or a 4-hour disaster.
Those first 15 minutes matter not because that’s when you fix things — it’s when you set the trajectory. Good first 15 minutes: clear ownership, fast triage, calm communication. Bad first 15 minutes: confusion, panic, three engineers debugging the same thing while nobody talks to the customer-facing team. I’ve been on both sides of this, and the difference is always the process, not the people.
“The problem is never how to get new, innovative thoughts into your mind, but how to get the old ones out.”
— Dee Hock (Visa founder, on why clear process beats heroics)
The First 15 Minutes: A Timeline
Here’s what should happen, minute by minute. Practice this before you need it — the first time you follow the playbook should never be during a real incident.
| Time | Action | Owner |
|---|
| T+0 | Alert fires, on-call acknowledges | On-call engineer |
| T+2 min | Severity assessment (SEV 1-4) | On-call engineer |
| T+3 min | Incident channel created, roles assigned | On-call engineer |
| T+5 min | First status update to stakeholders | Communicator |
| T+10 min | Initial hypothesis formed, first mitigation attempted | Resolver(s) |
| T+15 min | Escalation decision — do we need more people? | Incident commander |
| T+30 min | Second status update, refined diagnosis | Communicator |
| T+60 min | Hourly updates until resolution | Communicator |
| Resolution | Final update, channel archived | Commander |
| T+48 hrs | Postmortem written and reviewed | All participants |
When in doubt about severity, escalate up. A SEV 2 that gets upgraded to SEV 1 after 30 minutes wastes 30 minutes. A SEV 1 that gets downgraded to SEV 2 after 5 minutes wastes 5 minutes. Always err toward higher severity.
Severity Levels
Getting severity wrong has real costs. Too low, and a serious outage simmers while one engineer casually investigates. Too high, and you wake up leadership for a broken tooltip. Use a clear rubric:
| Severity | What It Means | Expected Response | Example |
|---|
| SEV 1 | Customer-facing outage, data loss risk, or financial impact | All hands, war room, executive communication | Payment processing down |
| SEV 2 | Major feature degraded, affecting many users | On-call + team lead, regular status updates | Search returning stale results for 30+ minutes |
| SEV 3 | Minor feature broken, workaround exists | On-call investigates during business hours | CSV export timing out for large datasets |
| SEV 4 | Cosmetic issue, no functional impact | Ticket created, fixed in normal sprint | Button misaligned on one page |
The Three Roles
Every incident needs exactly three roles. Not five, not “everyone jump in” — three. When roles are unclear, people either duplicate work or assume someone else is handling it.
Incident Commander (IC) — Owns the incident. Makes decisions about escalation, mitigation strategy, and when to declare resolution. The IC doesn’t debug — they coordinate. Think of them as air traffic control, not the pilot.
Communicator — Writes status updates, handles Slack messages to stakeholders, updates the status page, and fields questions from non-engineering teams. This role exists so the resolver can focus entirely on fixing things.
Resolver — The engineer(s) actually debugging and fixing the issue. They share findings in the incident channel but don’t write external communications or answer “what’s the ETA?” questions.
The communicator role is the most undervalued and most impactful. I’ve seen incidents with a dedicated communicator resolve 40% faster — not because communication fixes bugs, but because it frees the resolver from constant context-switching.
Communication Templates
Under stress, people write terrible updates. “We’re looking into it” tells stakeholders nothing. Templates solve this by removing the need to think about format when you should be thinking about the problem.
First status update (T+5 min):
INCIDENT: [Brief description]
Severity: SEV [1-4]
Impact: [Who is affected and how]
Status: Investigating
Commander: @[name]
We detected [what happened] at [time]. Currently investigating root cause. [Number] customers are affected. Next update in 25 minutes.
Progress update (T+30 min, then hourly):
UPDATE: [Brief description]
Status: Investigating / Identified / Mitigating / Resolved
What we know: [Current understanding]
What we’re doing: [Specific mitigation steps]
Impact: [Updated scope]
ETA: [Honest estimate, or “unknown — next update in 30 min”]
Resolution update:
RESOLVED: [Brief description]
Duration: [Total time from detection to resolution]
Root cause: [One sentence]
Customer impact: [Who was affected, for how long]
A full postmortem will be published within 48 hours.
War Room Rules
When multiple people are in an incident channel, chaos is the default. These rules keep things focused:
The Postmortem: Structure That Actually Works
Most postmortems are useless — they describe what happened, assign vague action items like “improve monitoring,” and are never read again. Use this structure instead:
| Section | What to Include | Why It Matters |
|---|
| Summary | 2-3 sentences: what happened, who was affected, how it was resolved | Quick context for anyone reading later |
| Impact | Users affected, revenue impact, duration of customer-facing degradation | Quantifies the severity honestly |
| Timeline (UTC) | Minute-by-minute log from detection to resolution | Reveals where time was lost |
| Root Cause | Detailed technical explanation — not “the DB was slow” but the specific query, PR, and missing index | Prevents vague understanding |
| What Went Well | Fast detection, clear runbooks, quick mitigation | Reinforces good practices |
| What Didn’t Go Well | Missed in code review, no load test, delayed status update | Identifies systemic gaps |
| Action Items | Each with an owner, due date, and ticket number | The only part that prevents repeats |
“Improve monitoring” is not an action item. “Add alert for connection pool utilization above 80%, owned by @engineer, due March 15, tracked in ENG-1234” is an action item. Every action item needs an owner, a due date, and a ticket — or it won’t happen.
Building an Incident Culture
The hardest part isn’t the process — it’s the culture. Blameless postmortems sound good in theory; in practice, they require active enforcement.
- Never ask “who broke this?” Ask “what made this breakage possible?” The human who wrote the bug is the least interesting factor. The system that allowed it to reach production is what you fix.
- Celebrate fast detection and clean response. Highlight incidents that were handled well. The message isn’t “we had fewer incidents” — it’s “we responded in 4 minutes and zero customers noticed.”
- Make incident participation career-positive. If someone runs a great incident, mention it in their performance review. If on-call work is invisible, your best people will avoid it.
- Review postmortem action items monthly. Incomplete action items are the number one way incidents repeat.
The goal isn’t zero incidents — that’s impossible at scale. The goal is that every incident makes the system more resilient, and no incident happens the same way twice.