The Three Pillars
Observability is built on three types of telemetry. Understanding each, and how they complement each other, is the foundation.Logs: What Happened
Logs are the narrative of your system — timestamped records of individual events. Structured logs beat unstructured logs in every way:Metrics: How Much / How Fast
Metrics are numerical measurements collected over time. Unlike logs, they’re aggregated — you don’t store every individual value, you store counts, sums, and distributions. The metrics worth tracking for most services:| Metric | What it tells you | How to measure |
|---|---|---|
| Request rate | Traffic volume and patterns | Counter: increment per request |
| Error rate | System health | Counter: increment per 4xx/5xx |
| Latency (p50, p95, p99) | User experience | Histogram: record duration per request |
| Saturation | How close to limit | Gauge: CPU %, queue depth, connection pool usage |
Traces: Why It Happened
Traces follow a request through every service and function call it touches — a causal chain from the user’s action to the database query and back. Without traces, you know the request took 267ms. With traces, you see the payment service took 145ms and has high variance — that’s where to look.The Vocabulary That Matters
These terms get confused constantly. Here’s the precise definition of each:| Term | Definition | Example |
|---|---|---|
| SLI (Service Level Indicator) | A measurable signal about service quality | ”p95 API latency = 342ms” |
| SLO (Service Level Objective) | Your internal target for an SLI | ”p95 latency < 500ms for 99% of requests” |
| SLA (Service Level Agreement) | An external contractual promise, often with financial consequences | ”99.9% uptime guarantee in customer contract” |
| Error Budget | How much you’re allowed to fail: 1 - SLO | If SLO is 99.9%, error budget is 0.1% downtime per month (~43 minutes) |
The Four Golden Signals
Google’s SRE book introduced the four golden signals — the minimum you need to tell if a service is healthy: If you can only instrument four things, instrument these four. If a service has healthy golden signals, it’s almost certainly working. If the golden signals look bad, you know where to start.OpenTelemetry: The Standard Worth Adopting
OpenTelemetry (OTel) is the vendor-neutral standard for observability instrumentation. It matters because it lets you:- Instrument your code once
- Send telemetry to any backend (Honeycomb, Datadog, Grafana, etc.)
- Switch backends without changing application code
The Dashboard Hierarchy
Not everyone needs the same view. Build for different audiences: Executive / Product view:- Overall system health (SLO compliance)
- User-facing error rates
- Business metric health (orders processed, signups, etc.)
- Golden signals per service
- Error budget burn rate
- Recent deployments and their impact
- Active alerts with runbook links
- Request traces with full context
- Log correlation by request ID
- Infrastructure metrics (CPU, memory, database connections)
Starting From Zero: A Practical Sequence
If you’re building observability from scratch, do it in this order:- Structured logging first — Add structured JSON logging everywhere. This gives you the most immediate value and requires no infrastructure.
- Correlation IDs — Generate a request ID at the entry point and pass it through every log statement and service call. This is the foundation of tracing.
- Four golden signals — Instrument latency, error rate, traffic, and saturation for your most critical service.
- Alerts on error rate — Before building dashboards, get paged when errors spike. This is the most urgent safety net.
- SLO definitions — Define what “healthy” means for each service before building more dashboards.
- Distributed traces — Once you have SLOs and golden signals, add distributed tracing to make debugging fast.
