Blameless Postmortems: Learning from Incidents

Why Incidents Are Valuable (If You Use Them Right)
The Incident Narrative Template
Running the Blameless Review
The Blameless Principle in Practice
Review Meeting Structure (60 minutes)
Cost Observability: Treating Spend Like an SLO
Setting Up Cost Alerts
Error Budget: The Decision Framework
Burn Rate Alerts
The Reliability Ritual

Every incident is a system failure, not a person failure. The team that understands this builds better software than the team that doesn’t. I’ve run incident reviews at Atlassian, at Weel, and across my own products. The ones that improved our systems shared a structure. The ones that felt good but changed nothing usually lacked it.

Why Incidents Are Valuable (If You Use Them Right)

An incident is the most information-dense event your system can produce. In a few hours, you discover:

Where your monitoring has blind spots
Which assumptions in your architecture were wrong
Which runbooks are incomplete
How your communication patterns break under pressure

A blameless postmortem is the mechanism for extracting that value. Without a deliberate process, incidents either produce blame (which stops people from being honest about what happened) or they produce nothing (a brief moment of “that was bad” before everyone moves on).

The Incident Narrative Template

Every significant incident gets a written narrative within 48 hours. Here’s the template I use:

## Incident: [Service] [Brief description]
**Date:** YYYY-MM-DD
**Severity:** P1 / P2 / P3
**Duration:** X hours Y minutes
**Author:** [Name]
**Status:** Resolved / Monitoring

---

### Summary
One paragraph. What happened, who was affected, when it was resolved.

### Customer Impact
- **Who felt it:** Which users/segments were affected
- **What they experienced:** Error messages, degraded performance, data loss
- **Scale:** Estimated affected users and duration

### Timeline
| Time (UTC) | Event |
|---|---|
| HH:MM | First alert fired / First user report |
| HH:MM | On-call engineer paged |
| HH:MM | Incident declared, war room opened |
| HH:MM | Root cause identified |
| HH:MM | Mitigation applied |
| HH:MM | System restored to normal |
| HH:MM | Incident closed |

### Root Cause
Technical explanation of what failed and why. Be specific.
"The database connection pool exhausted because..." not "there was a database issue."

### Contributing Factors
What made this possible? Common ones:
- Lack of monitoring / alerting gaps
- Runbook was out of date
- Deploy process didn't have adequate safeguards
- Insufficient load testing
- Technical debt in the affected component

### What Went Well
Things that worked: automation that helped, observability that gave visibility,
communication that was clear. This matters — reinforce what worked.

### Action Items
| Item | Owner | Due | Priority |
|---|---|---|---|
| Add alert for connection pool usage | @engineer | 2 weeks | High |
| Update runbook for DB connectivity issues | @engineer | 1 week | High |
| Load test with 2x traffic before next major deploy | @lead | Next release | Medium |

The action items table is the most important part. Without it, the postmortem is an interesting story that changes nothing. With it, it’s a forcing function for improvement.

Running the Blameless Review

The review meeting is not a debrief session — it’s a design session. The question isn’t “what went wrong?” but “what does our system need to make this impossible or detectable earlier?”

The Blameless Principle in Practice

Blameless doesn’t mean no accountability. It means accountability for systems, not people. Blame framing: “The engineer on-call took 40 minutes to respond.” Blameless framing: “Our paging policy doesn’t have escalation if the primary on-call doesn’t acknowledge within 15 minutes. This incident revealed that gap.” The difference: the first shuts down honest conversation and creates fear. The second produces an actionable improvement that would have helped regardless of who was on-call.

Review Meeting Structure (60 minutes)

0-5 min:    Facilitator reads the summary
5-20 min:   Walk the timeline together — add detail, correct errors
20-35 min:  Root cause discussion — "why was this possible?"
35-50 min:  Action item review — challenge each item: specific? owned? testable?
50-60 min:  What went well? Explicit acknowledgment of good work

Rules:

No assigning blame to individuals in the meeting
Every “we should have known” becomes “what monitoring would have caught this?”
Every “someone should have…” becomes “what process makes this automatic?”

Cost Observability: Treating Spend Like an SLO

The most underused observability practice: treating infrastructure cost as a first-class reliability concern. I set Cost SLOs for every product:

MetaLabs infrastructure: under $4,500/month
PromptLib API costs: under $200/month
Weel AI features: under $0.002 per request

When cost exceeds the SLO, I treat it like a reliability incident:

Alert fires
Investigation into what changed
Mitigation applied (caching, model downgrade, query optimization)
Postmortem written

This reframes cost from “finance problem” to “engineering problem.” Engineers respond to alerts. They don’t respond to monthly billing reports.

Setting Up Cost Alerts

// AWS Cost Anomaly Detection with SNS notification
// (in Terraform or CDK)

resource "aws_ce_anomaly_monitor" "product_monitor" {
  name         = "product-cost-monitor"
  monitor_type = "DIMENSIONAL"

  monitor_dimension = "SERVICE"
}

resource "aws_ce_anomaly_subscription" "alerts" {
  name      = "cost-spike-alert"
  frequency = "IMMEDIATE"

  monitor_arn_list = [aws_ce_anomaly_monitor.product_monitor.arn]

  subscriber {
    type    = "SNS"
    address = aws_sns_topic.alerts.arn
  }

  threshold_expression {
    dimension {
      key           = "ANOMALY_TOTAL_IMPACT_PERCENTAGE"
      values        = ["20"]  # Alert on 20%+ week-over-week increase
      match_options = ["GREATER_THAN_OR_EQUAL"]
    }
  }
}

Error Budget: The Decision Framework

The error budget isn’t just a metric — it’s a decision-making tool. This gives engineers a clear framework: “We have 35% of our error budget left this month. We can ship the new payment flow but should skip the database migration until next month.” Without error budgets, reliability is a constant negotiation. With them, it’s a number.

Burn Rate Alerts

Error budget burn rate tells you how fast you’re consuming your monthly budget. High burn rate now means outage later.

Burn rate	What it means	Response
1x	Normal — on pace to consume 100% by month end	No action
2x	Elevated — will exhaust budget halfway through month	Investigate
5x	High — will exhaust budget in 6 days	Page on-call
14x+	Critical — will exhaust budget in ~2 days	Emergency response

// Calculate burn rate from Prometheus metrics
const burnRate = (errorBudgetConsumed / elapsedTimeRatio);

// Alert thresholds
if (burnRate > 14) page({ severity: 'critical', message: `Burn rate: ${burnRate}x` });
if (burnRate > 5) page({ severity: 'high', message: `Burn rate: ${burnRate}x` });
if (burnRate > 2) notify({ channel: '#reliability', message: `Burn rate elevated: ${burnRate}x` });

The Reliability Ritual

A reliable system isn’t built from heroics — it’s built from consistent practice. Daily: Check error budget burn rate (automated Slack digest). Takes 30 seconds. Weekly: Review incidents from the past week. Are action items on track? Any recurring patterns? Monthly: Review SLO targets. Are they still meaningful? Has user expectation shifted? Are targets too tight (draining budget constantly) or too loose (not measuring real user pain)? Per incident: Write the narrative, run the review, create action items, and follow up within 2 weeks. The teams I’ve seen do this consistently ship faster than the ones who don’t — because they build up a reliable system that doesn’t generate incident interruptions.

Observability: The Engineer's Primer

SLO Playbook: Setting Objectives That Actually Drive Engineering

Engineer × AI

AI Dev Toolkit

Shipping AI

AI Foundations

TypeScript at Scale

Design Systems

Deep Dives

System Design & Architecture

Shipping & DevOps

Testing & Quality

Observability

Career & Engineering Leadership

Blameless Postmortems: Learning from Incidents

Why Incidents Are Valuable (If You Use Them Right)

The Incident Narrative Template

Running the Blameless Review

The Blameless Principle in Practice

Review Meeting Structure (60 minutes)

Cost Observability: Treating Spend Like an SLO

Setting Up Cost Alerts

Error Budget: The Decision Framework

Burn Rate Alerts

The Reliability Ritual

Engineer × AI

AI Dev Toolkit

Shipping AI

AI Foundations

TypeScript at Scale

Design Systems

Deep Dives

System Design & Architecture

Shipping & DevOps

Testing & Quality

Observability

Career & Engineering Leadership

​Why Incidents Are Valuable (If You Use Them Right)

​The Incident Narrative Template

​Running the Blameless Review

​The Blameless Principle in Practice

​Review Meeting Structure (60 minutes)

​Cost Observability: Treating Spend Like an SLO

​Setting Up Cost Alerts

​Error Budget: The Decision Framework

​Burn Rate Alerts

​The Reliability Ritual

Why Incidents Are Valuable (If You Use Them Right)

The Incident Narrative Template

Running the Blameless Review

The Blameless Principle in Practice

Review Meeting Structure (60 minutes)

Cost Observability: Treating Spend Like an SLO

Setting Up Cost Alerts

Error Budget: The Decision Framework

Burn Rate Alerts

The Reliability Ritual