Every incident is a system failure, not a person failure. The team that understands this builds better software than the team that doesn’t.
I’ve run incident reviews at Atlassian, at Weel, and across my own products. The ones that improved our systems shared a structure. The ones that felt good but changed nothing usually lacked it.
Why Incidents Are Valuable (If You Use Them Right)
An incident is the most information-dense event your system can produce. In a few hours, you discover:
- Where your monitoring has blind spots
- Which assumptions in your architecture were wrong
- Which runbooks are incomplete
- How your communication patterns break under pressure
A blameless postmortem is the mechanism for extracting that value. Without a deliberate process, incidents either produce blame (which stops people from being honest about what happened) or they produce nothing (a brief moment of “that was bad” before everyone moves on).
The Incident Narrative Template
Every significant incident gets a written narrative within 48 hours. Here’s the template I use:
## Incident: [Service] [Brief description]
**Date:** YYYY-MM-DD
**Severity:** P1 / P2 / P3
**Duration:** X hours Y minutes
**Author:** [Name]
**Status:** Resolved / Monitoring
---
### Summary
One paragraph. What happened, who was affected, when it was resolved.
### Customer Impact
- **Who felt it:** Which users/segments were affected
- **What they experienced:** Error messages, degraded performance, data loss
- **Scale:** Estimated affected users and duration
### Timeline
| Time (UTC) | Event |
|---|---|
| HH:MM | First alert fired / First user report |
| HH:MM | On-call engineer paged |
| HH:MM | Incident declared, war room opened |
| HH:MM | Root cause identified |
| HH:MM | Mitigation applied |
| HH:MM | System restored to normal |
| HH:MM | Incident closed |
### Root Cause
Technical explanation of what failed and why. Be specific.
"The database connection pool exhausted because..." not "there was a database issue."
### Contributing Factors
What made this possible? Common ones:
- Lack of monitoring / alerting gaps
- Runbook was out of date
- Deploy process didn't have adequate safeguards
- Insufficient load testing
- Technical debt in the affected component
### What Went Well
Things that worked: automation that helped, observability that gave visibility,
communication that was clear. This matters — reinforce what worked.
### Action Items
| Item | Owner | Due | Priority |
|---|---|---|---|
| Add alert for connection pool usage | @engineer | 2 weeks | High |
| Update runbook for DB connectivity issues | @engineer | 1 week | High |
| Load test with 2x traffic before next major deploy | @lead | Next release | Medium |
The action items table is the most important part. Without it, the postmortem is an interesting story that changes nothing. With it, it’s a forcing function for improvement.
Running the Blameless Review
The review meeting is not a debrief session — it’s a design session. The question isn’t “what went wrong?” but “what does our system need to make this impossible or detectable earlier?”
The Blameless Principle in Practice
Blameless doesn’t mean no accountability. It means accountability for systems, not people.
Blame framing: “The engineer on-call took 40 minutes to respond.”
Blameless framing: “Our paging policy doesn’t have escalation if the primary on-call doesn’t acknowledge within 15 minutes. This incident revealed that gap.”
The difference: the first shuts down honest conversation and creates fear. The second produces an actionable improvement that would have helped regardless of who was on-call.
Review Meeting Structure (60 minutes)
0-5 min: Facilitator reads the summary
5-20 min: Walk the timeline together — add detail, correct errors
20-35 min: Root cause discussion — "why was this possible?"
35-50 min: Action item review — challenge each item: specific? owned? testable?
50-60 min: What went well? Explicit acknowledgment of good work
Rules:
- No assigning blame to individuals in the meeting
- Every “we should have known” becomes “what monitoring would have caught this?”
- Every “someone should have…” becomes “what process makes this automatic?”
Cost Observability: Treating Spend Like an SLO
The most underused observability practice: treating infrastructure cost as a first-class reliability concern.
I set Cost SLOs for every product:
- MetaLabs infrastructure: under $4,500/month
- PromptLib API costs: under $200/month
- Weel AI features: under $0.002 per request
When cost exceeds the SLO, I treat it like a reliability incident:
- Alert fires
- Investigation into what changed
- Mitigation applied (caching, model downgrade, query optimization)
- Postmortem written
This reframes cost from “finance problem” to “engineering problem.” Engineers respond to alerts. They don’t respond to monthly billing reports.
Setting Up Cost Alerts
// AWS Cost Anomaly Detection with SNS notification
// (in Terraform or CDK)
resource "aws_ce_anomaly_monitor" "product_monitor" {
name = "product-cost-monitor"
monitor_type = "DIMENSIONAL"
monitor_dimension = "SERVICE"
}
resource "aws_ce_anomaly_subscription" "alerts" {
name = "cost-spike-alert"
frequency = "IMMEDIATE"
monitor_arn_list = [aws_ce_anomaly_monitor.product_monitor.arn]
subscriber {
type = "SNS"
address = aws_sns_topic.alerts.arn
}
threshold_expression {
dimension {
key = "ANOMALY_TOTAL_IMPACT_PERCENTAGE"
values = ["20"] # Alert on 20%+ week-over-week increase
match_options = ["GREATER_THAN_OR_EQUAL"]
}
}
}
Error Budget: The Decision Framework
The error budget isn’t just a metric — it’s a decision-making tool.
This gives engineers a clear framework: “We have 35% of our error budget left this month. We can ship the new payment flow but should skip the database migration until next month.”
Without error budgets, reliability is a constant negotiation. With them, it’s a number.
Burn Rate Alerts
Error budget burn rate tells you how fast you’re consuming your monthly budget. High burn rate now means outage later.
| Burn rate | What it means | Response |
|---|
| 1x | Normal — on pace to consume 100% by month end | No action |
| 2x | Elevated — will exhaust budget halfway through month | Investigate |
| 5x | High — will exhaust budget in 6 days | Page on-call |
| 14x+ | Critical — will exhaust budget in ~2 days | Emergency response |
// Calculate burn rate from Prometheus metrics
const burnRate = (errorBudgetConsumed / elapsedTimeRatio);
// Alert thresholds
if (burnRate > 14) page({ severity: 'critical', message: `Burn rate: ${burnRate}x` });
if (burnRate > 5) page({ severity: 'high', message: `Burn rate: ${burnRate}x` });
if (burnRate > 2) notify({ channel: '#reliability', message: `Burn rate elevated: ${burnRate}x` });
The Reliability Ritual
A reliable system isn’t built from heroics — it’s built from consistent practice.
Daily: Check error budget burn rate (automated Slack digest). Takes 30 seconds.
Weekly: Review incidents from the past week. Are action items on track? Any recurring patterns?
Monthly: Review SLO targets. Are they still meaningful? Has user expectation shifted? Are targets too tight (draining budget constantly) or too loose (not measuring real user pain)?
Per incident: Write the narrative, run the review, create action items, and follow up within 2 weeks.
The teams I’ve seen do this consistently ship faster than the ones who don’t — because they build up a reliable system that doesn’t generate incident interruptions.