Skip to main content

Observability Galaxy · Primer

Before diving into playbooks and tooling, this primer explains how I treat observability inside the Dev and Productivity stacks.

1. Vocabulary

  • SLA (Service Level Agreement) – External promise; contractual.
  • SLO (Service Level Objective) – Internal target; drives engineering focus.
  • SLI (Service Level Indicator) – Measurable metric (latency, errors, freshness, etc.).
  • Error Budget – 1 - SLO. Spend it intentionally on launches, migrations, experiments.

2. Telemetry Pillars

  1. Logs – Structured JSON shipped via OpenTelemetry → Vector → Loki.
  2. Metrics – Prometheus + CloudWatch, aggregated into Honeycomb and Grafana dashboards.
  3. Traces – OpenTelemetry tracing, correlated with session replay for frontend contexts.

3. Dashboard Anatomy

  • Executive view: SLA compliance, uptime, cost overlay.
  • Pod view: SLO burn-down, top incidents, regression alerts.
  • Feature view: specific productivity workflow or AI lab agent performance.

4. Rituals

  • Daily: Check burn rates via Slack digest.
  • Weekly: Incident narrative review (see dev/observability.mdx).
  • Monthly: Cost observability sync—compare spend vs value.
Use the remaining pages in this section for deeper dives into SLO/SLI craft and tooling.