Skip to main content

Incident Narratives & Cost Observability

Observability is more than dashboards; it’s storytelling plus cost awareness.

1. Incident Narrative Template

### Context
- Date/time, impacted services, trigger.

### Customer Impact
- Who felt it? What symptoms?

### Timeline
- Detection → Mitigation → Resolution.

### Contributing Factors
- Technical, process, people.

### What Went Well
- Automation, observability, comms wins.

### Improvement Actions
- Owners + due dates + follow-up links.
  • Narratives live in /runbooks/incidents/[id].mdx with tags (SLO, service, severity).
  • Share summaries in weekly ops review + Slack #observability channel.

2. Blameless Culture Checklist

  • Focus on systems, not individuals.
  • Assign improvement actions within 48h.
  • Review old actions monthly; close or re-scope.

3. Cost Observability

  • Pipe AWS/GCP billing data into Metabase + Looker Studio.
  • Tag resources per product (MetaLabs, productivity suite, AI lab) + environment.
  • Create Cost SLO: e.g., “MetaLabs infra under $4.5k/month”; treat overruns like incidents.

4. Burn Rate Dashboard

  • 1h, 6h, 24h burn charts with thresholds.
  • Slack alerts when burn > 2x budget.
  • Drill down by customer cohort to see who is impacted.

5. Toolchain

  • OpenTelemetry Collector → Honeycomb traces.
  • Prometheus + Grafana for metrics; exported to Datadog for exec view.
  • CloudZero (or custom scripts) for FinOps automation.
Keep this doc handy during retro meetings or when you need to justify reliability investments.