Observability Galaxy · Primer
Before diving into playbooks and tooling, this primer explains how I treat observability inside the Dev and Productivity stacks.1. Vocabulary
- SLA (Service Level Agreement) – External promise; contractual.
- SLO (Service Level Objective) – Internal target; drives engineering focus.
- SLI (Service Level Indicator) – Measurable metric (latency, errors, freshness, etc.).
- Error Budget – 1 - SLO. Spend it intentionally on launches, migrations, experiments.
2. Telemetry Pillars
- Logs – Structured JSON shipped via OpenTelemetry → Vector → Loki.
- Metrics – Prometheus + CloudWatch, aggregated into Honeycomb and Grafana dashboards.
- Traces – OpenTelemetry tracing, correlated with session replay for frontend contexts.
3. Dashboard Anatomy
- Executive view: SLA compliance, uptime, cost overlay.
- Pod view: SLO burn-down, top incidents, regression alerts.
- Feature view: specific productivity workflow or AI lab agent performance.
4. Rituals
- Daily: Check burn rates via Slack digest.
- Weekly: Incident narrative review (see
dev/observability.mdx). - Monthly: Cost observability sync—compare spend vs value.
