Observability Galaxy · Primer
1. Vocabulary
2. Telemetry Pillars
3. Dashboard Anatomy
4. Rituals

Observability Galaxy · Primer

Before diving into playbooks and tooling, this primer explains how I treat observability inside the Dev and Productivity stacks.

1. Vocabulary

SLA (Service Level Agreement) – External promise; contractual.
SLO (Service Level Objective) – Internal target; drives engineering focus.
SLI (Service Level Indicator) – Measurable metric (latency, errors, freshness, etc.).
Error Budget – 1 - SLO. Spend it intentionally on launches, migrations, experiments.

2. Telemetry Pillars

Logs – Structured JSON shipped via OpenTelemetry → Vector → Loki.
Metrics – Prometheus + CloudWatch, aggregated into Honeycomb and Grafana dashboards.
Traces – OpenTelemetry tracing, correlated with session replay for frontend contexts.

3. Dashboard Anatomy

Executive view: SLA compliance, uptime, cost overlay.
Pod view: SLO burn-down, top incidents, regression alerts.
Feature view: specific productivity workflow or AI lab agent performance.

4. Rituals

Daily: Check burn rates via Slack digest.
Weekly: Incident narrative review (see dev/observability.mdx).
Monthly: Cost observability sync—compare spend vs value.

Use the remaining pages in this section for deeper dives into SLO/SLI craft and tooling.

Incident Narratives & Cost Observability