Incident Response: The First 15 Minutes That Define Everything

Engineering team collaborating at a whiteboard during an incident review

It’s 2:47 AM on a Tuesday. Your phone buzzes. A critical service is down. Real users, real impact, real consequences. What you do in the next 15 minutes determines whether this is a 20-minute incident or a 4-hour disaster. Those first 15 minutes matter not because that’s when you fix things — it’s when you set the trajectory. Good first 15 minutes: clear ownership, fast triage, calm communication. Bad first 15 minutes: confusion, panic, three engineers debugging the same thing while nobody talks to the customer-facing team. I’ve been on both sides of this, and the difference is always the process, not the people.

“The problem is never how to get new, innovative thoughts into your mind, but how to get the old ones out.” — Dee Hock (Visa founder, on why clear process beats heroics)

The First 15 Minutes: A Timeline

Here’s what should happen, minute by minute. Practice this before you need it — the first time you follow the playbook should never be during a real incident.

Time	Action	Owner
T+0	Alert fires, on-call acknowledges	On-call engineer
T+2 min	Severity assessment (SEV 1-4)	On-call engineer
T+3 min	Incident channel created, roles assigned	On-call engineer
T+5 min	First status update to stakeholders	Communicator
T+10 min	Initial hypothesis formed, first mitigation attempted	Resolver(s)
T+15 min	Escalation decision — do we need more people?	Incident commander
T+30 min	Second status update, refined diagnosis	Communicator
T+60 min	Hourly updates until resolution	Communicator
Resolution	Final update, channel archived	Commander
T+48 hrs	Postmortem written and reviewed	All participants

When in doubt about severity, escalate up. A SEV 2 that gets upgraded to SEV 1 after 30 minutes wastes 30 minutes. A SEV 1 that gets downgraded to SEV 2 after 5 minutes wastes 5 minutes. Always err toward higher severity.

Severity Levels

Getting severity wrong has real costs. Too low, and a serious outage simmers while one engineer casually investigates. Too high, and you wake up leadership for a broken tooltip. Use a clear rubric:

Severity	What It Means	Expected Response	Example
SEV 1	Customer-facing outage, data loss risk, or financial impact	All hands, war room, executive communication	Payment processing down
SEV 2	Major feature degraded, affecting many users	On-call + team lead, regular status updates	Search returning stale results for 30+ minutes
SEV 3	Minor feature broken, workaround exists	On-call investigates during business hours	CSV export timing out for large datasets
SEV 4	Cosmetic issue, no functional impact	Ticket created, fixed in normal sprint	Button misaligned on one page

The Three Roles

Every incident needs exactly three roles. Not five, not “everyone jump in” — three. When roles are unclear, people either duplicate work or assume someone else is handling it. Incident Commander (IC) — Owns the incident. Makes decisions about escalation, mitigation strategy, and when to declare resolution. The IC doesn’t debug — they coordinate. Think of them as air traffic control, not the pilot. Communicator — Writes status updates, handles Slack messages to stakeholders, updates the status page, and fields questions from non-engineering teams. This role exists so the resolver can focus entirely on fixing things. Resolver — The engineer(s) actually debugging and fixing the issue. They share findings in the incident channel but don’t write external communications or answer “what’s the ETA?” questions.

The communicator role is the most undervalued and most impactful. I’ve seen incidents with a dedicated communicator resolve 40% faster — not because communication fixes bugs, but because it frees the resolver from constant context-switching.

Communication Templates

Under stress, people write terrible updates. “We’re looking into it” tells stakeholders nothing. Templates solve this by removing the need to think about format when you should be thinking about the problem. First status update (T+5 min):

INCIDENT: [Brief description] Severity: SEV [1-4] Impact: [Who is affected and how] Status: Investigating Commander: @[name] We detected [what happened] at [time]. Currently investigating root cause. [Number] customers are affected. Next update in 25 minutes.

Progress update (T+30 min, then hourly):

UPDATE: [Brief description] Status: Investigating / Identified / Mitigating / Resolved What we know: [Current understanding] What we’re doing: [Specific mitigation steps] Impact: [Updated scope] ETA: [Honest estimate, or “unknown — next update in 30 min”]

Resolution update:

RESOLVED: [Brief description] Duration: [Total time from detection to resolution] Root cause: [One sentence] Customer impact: [Who was affected, for how long] A full postmortem will be published within 48 hours.

War Room Rules

When multiple people are in an incident channel, chaos is the default. These rules keep things focused:

Prefix messages with your role. [RESOLVER] Checking database metrics vs [COMMS] Posting status page update. Prevents crosstalk.
No drive-by debugging. If you’re not an assigned resolver, don’t start running queries. Coordinate with the IC first.
Thread long investigations. The main channel should be a clean timeline. Deep dives go in threads.
Announce before you act. “I’m going to restart the API pods” gives the IC a chance to say “wait — let me capture metrics first.”
Update even when nothing has changed. Silence during an incident is terrifying. “Still investigating, no new findings” is better than 20 minutes of nothing.

The Postmortem: Structure That Actually Works

Most postmortems are useless — they describe what happened, assign vague action items like “improve monitoring,” and are never read again. Use this structure instead:

Section	What to Include	Why It Matters
Summary	2-3 sentences: what happened, who was affected, how it was resolved	Quick context for anyone reading later
Impact	Users affected, revenue impact, duration of customer-facing degradation	Quantifies the severity honestly
Timeline (UTC)	Minute-by-minute log from detection to resolution	Reveals where time was lost
Root Cause	Detailed technical explanation — not “the DB was slow” but the specific query, PR, and missing index	Prevents vague understanding
What Went Well	Fast detection, clear runbooks, quick mitigation	Reinforces good practices
What Didn’t Go Well	Missed in code review, no load test, delayed status update	Identifies systemic gaps
Action Items	Each with an owner, due date, and ticket number	The only part that prevents repeats

“Improve monitoring” is not an action item. “Add alert for connection pool utilization above 80%, owned by @engineer, due March 15, tracked in ENG-1234” is an action item. Every action item needs an owner, a due date, and a ticket — or it won’t happen.

Building an Incident Culture

The hardest part isn’t the process — it’s the culture. Blameless postmortems sound good in theory; in practice, they require active enforcement.

Never ask “who broke this?” Ask “what made this breakage possible?” The human who wrote the bug is the least interesting factor. The system that allowed it to reach production is what you fix.
Celebrate fast detection and clean response. Highlight incidents that were handled well. The message isn’t “we had fewer incidents” — it’s “we responded in 4 minutes and zero customers noticed.”
Make incident participation career-positive. If someone runs a great incident, mention it in their performance review. If on-call work is invisible, your best people will avoid it.
Review postmortem action items monthly. Incomplete action items are the number one way incidents repeat.

The goal isn’t zero incidents — that’s impossible at scale. The goal is that every incident makes the system more resilient, and no incident happens the same way twice.

TypeScript at Scale

Design Systems

Deep Dives

System Design & Architecture

Career & Engineering Leadership

Shipping & DevOps

Testing & Quality

Observability

Incident Response: The First 15 Minutes That Define Everything

The First 15 Minutes: A Timeline

Severity Levels

The Three Roles

Communication Templates

War Room Rules

The Postmortem: Structure That Actually Works

Building an Incident Culture

TypeScript at Scale

Design Systems

Deep Dives

System Design & Architecture

Career & Engineering Leadership

Shipping & DevOps

Testing & Quality

Observability

​The First 15 Minutes: A Timeline

​Severity Levels

​The Three Roles

​Communication Templates

​War Room Rules

​The Postmortem: Structure That Actually Works

​Building an Incident Culture

The First 15 Minutes: A Timeline

Severity Levels

The Three Roles

Communication Templates

War Room Rules

The Postmortem: Structure That Actually Works

Building an Incident Culture