Incident Narratives & Cost Observability
Observability is more than dashboards; it’s storytelling plus cost awareness.1. Incident Narrative Template
- Narratives live in
/runbooks/incidents/[id].mdxwith tags (SLO, service, severity). - Share summaries in weekly ops review + Slack #observability channel.
2. Blameless Culture Checklist
- Focus on systems, not individuals.
- Assign improvement actions within 48h.
- Review old actions monthly; close or re-scope.
3. Cost Observability
- Pipe AWS/GCP billing data into Metabase + Looker Studio.
- Tag resources per product (MetaLabs, productivity suite, AI lab) + environment.
- Create Cost SLO: e.g., “MetaLabs infra under $4.5k/month”; treat overruns like incidents.
4. Burn Rate Dashboard
- 1h, 6h, 24h burn charts with thresholds.
- Slack alerts when burn > 2x budget.
- Drill down by customer cohort to see who is impacted.
5. Toolchain
- OpenTelemetry Collector → Honeycomb traces.
- Prometheus + Grafana for metrics; exported to Datadog for exec view.
- CloudZero (or custom scripts) for FinOps automation.
