Observability: The Engineer's Primer

The Three Pillars
Logs: What Happened
Metrics: How Much / How Fast
Traces: Why It Happened
The Vocabulary That Matters
The Four Golden Signals
OpenTelemetry: The Standard Worth Adopting
The Dashboard Hierarchy
Starting From Zero: A Practical Sequence

Monitoring tells you something is wrong. Observability tells you why. Most teams have monitoring. They have uptime checks, error rate alerts, and dashboards that go red when things break. What they don’t have is the ability to answer a question they’ve never asked before using data they’re already collecting. That’s the goal of observability: make your system understandable from the outside, without having to redeploy or add instrumentation every time you hit a new kind of failure.

The Three Pillars

Observability is built on three types of telemetry. Understanding each, and how they complement each other, is the foundation.

Logs: What Happened

Logs are the narrative of your system — timestamped records of individual events. Structured logs beat unstructured logs in every way:

// Bad: unstructured log
console.log(`User ${userId} uploaded file ${filename} at ${new Date()}`);

// Good: structured JSON log
logger.info({
  event: "file_uploaded",
  userId,
  filename,
  fileSize: file.size,
  mimeType: file.type,
  durationMs: Date.now() - startTime,
  requestId,  // Correlation ID to trace the request
});

Structured logs are queryable. “Show me all file uploads over 10MB in the last hour” takes 2 seconds. With unstructured logs, it’s a regex nightmare.

Metrics: How Much / How Fast

Metrics are numerical measurements collected over time. Unlike logs, they’re aggregated — you don’t store every individual value, you store counts, sums, and distributions. The metrics worth tracking for most services:

Metric	What it tells you	How to measure
Request rate	Traffic volume and patterns	Counter: increment per request
Error rate	System health	Counter: increment per 4xx/5xx
Latency (p50, p95, p99)	User experience	Histogram: record duration per request
Saturation	How close to limit	Gauge: CPU %, queue depth, connection pool usage

// Prometheus-style metrics with prom-client
import client from 'prom-client';

const httpRequestDuration = new client.Histogram({
  name: 'http_request_duration_ms',
  help: 'Duration of HTTP requests in ms',
  labelNames: ['method', 'route', 'status_code'],
  buckets: [5, 10, 25, 50, 100, 250, 500, 1000, 2500, 5000]
});

// In your request handler middleware:
const end = httpRequestDuration.startTimer();
// ... handle request ...
end({ method: req.method, route: req.route?.path, status_code: res.statusCode });

Traces: Why It Happened

Traces follow a request through every service and function call it touches — a causal chain from the user’s action to the database query and back. Without traces, you know the request took 267ms. With traces, you see the payment service took 145ms and has high variance — that’s where to look.

import { trace, context, SpanStatusCode } from '@opentelemetry/api';

const tracer = trace.getTracer('order-service');

async function createOrder(cart: Cart): Promise<Order> {
  return tracer.startActiveSpan('order.create', async (span) => {
    span.setAttributes({
      'order.cart_id': cart.id,
      'order.item_count': cart.items.length,
      'order.total': cart.total,
    });

    try {
      const inventory = await checkInventory(cart); // Creates child span
      const order = await persistOrder(cart, inventory); // Creates child span
      span.setStatus({ code: SpanStatusCode.OK });
      return order;
    } catch (error) {
      span.recordException(error);
      span.setStatus({ code: SpanStatusCode.ERROR });
      throw error;
    } finally {
      span.end();
    }
  });
}

The Vocabulary That Matters

These terms get confused constantly. Here’s the precise definition of each:

Term	Definition	Example
SLI (Service Level Indicator)	A measurable signal about service quality	”p95 API latency = 342ms”
SLO (Service Level Objective)	Your internal target for an SLI	”p95 latency < 500ms for 99% of requests”
SLA (Service Level Agreement)	An external contractual promise, often with financial consequences	”99.9% uptime guarantee in customer contract”
Error Budget	How much you’re allowed to fail: 1 - SLO	If SLO is 99.9%, error budget is 0.1% downtime per month (~43 minutes)

The relationship matters: SLAs are promises to customers. SLOs are your internal targets, tighter than the SLA to give you buffer. SLIs are the measurements you use to know whether you’re meeting your SLOs. Error budgets are the most important concept to internalize. If your monthly error budget is 43 minutes and you’ve used 38 minutes in week 2, you need to slow down risky changes. If you have 40 minutes left on the last day of the month, you can move fast.

The Four Golden Signals

Google’s SRE book introduced the four golden signals — the minimum you need to tell if a service is healthy: If you can only instrument four things, instrument these four. If a service has healthy golden signals, it’s almost certainly working. If the golden signals look bad, you know where to start.

OpenTelemetry: The Standard Worth Adopting

OpenTelemetry (OTel) is the vendor-neutral standard for observability instrumentation. It matters because it lets you:

Instrument your code once
Send telemetry to any backend (Honeycomb, Datadog, Grafana, etc.)
Switch backends without changing application code

Setup in a Node.js service:

// instrumentation.ts — must be loaded before everything else
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-http';
import { PeriodicExportingMetricReader } from '@opentelemetry/sdk-metrics';

const sdk = new NodeSDK({
  serviceName: 'order-service',
  traceExporter: new OTLPTraceExporter({
    url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT,
  }),
  metricReader: new PeriodicExportingMetricReader({
    exporter: new OTLPMetricExporter(),
    exportIntervalMillis: 60000,
  }),
  instrumentations: [
    getNodeAutoInstrumentations(), // Auto-instruments HTTP, DB, gRPC etc.
  ],
});

sdk.start();

With auto-instrumentation, you get traces for HTTP requests, database queries, and outbound calls without writing any trace code. Add custom spans for your business logic on top.

The Dashboard Hierarchy

Not everyone needs the same view. Build for different audiences: Executive / Product view:

Overall system health (SLO compliance)
User-facing error rates
Business metric health (orders processed, signups, etc.)

On-call / Engineering view:

Golden signals per service
Error budget burn rate
Recent deployments and their impact
Active alerts with runbook links

Debugging view (per incident):

Request traces with full context
Log correlation by request ID
Infrastructure metrics (CPU, memory, database connections)

Starting From Zero: A Practical Sequence

If you’re building observability from scratch, do it in this order:

Structured logging first — Add structured JSON logging everywhere. This gives you the most immediate value and requires no infrastructure.
Correlation IDs — Generate a request ID at the entry point and pass it through every log statement and service call. This is the foundation of tracing.
Four golden signals — Instrument latency, error rate, traffic, and saturation for your most critical service.
Alerts on error rate — Before building dashboards, get paged when errors spike. This is the most urgent safety net.
SLO definitions — Define what “healthy” means for each service before building more dashboards.
Distributed traces — Once you have SLOs and golden signals, add distributed tracing to make debugging fast.

The temptation is to build comprehensive dashboards first. Resist it. The most valuable observability investment in the early stages is getting paged on real problems and having enough information to debug them.

E2E with Playwright: The Patterns That Actually Scale Beyond 50 Tests

Blameless Postmortems: Learning from Incidents

Engineer × AI

AI Dev Toolkit

Shipping AI

AI Foundations

TypeScript at Scale

Design Systems

Deep Dives

System Design & Architecture

Shipping & DevOps

Testing & Quality

Observability

Career & Engineering Leadership

Observability: The Engineer's Primer

The Three Pillars

Logs: What Happened

Metrics: How Much / How Fast

Traces: Why It Happened

The Vocabulary That Matters

The Four Golden Signals

OpenTelemetry: The Standard Worth Adopting

The Dashboard Hierarchy

Starting From Zero: A Practical Sequence

Engineer × AI

AI Dev Toolkit

Shipping AI

AI Foundations

TypeScript at Scale

Design Systems

Deep Dives

System Design & Architecture

Shipping & DevOps

Testing & Quality

Observability

Career & Engineering Leadership

​The Three Pillars

​Logs: What Happened

​Metrics: How Much / How Fast

​Traces: Why It Happened

​The Vocabulary That Matters

​The Four Golden Signals

​OpenTelemetry: The Standard Worth Adopting

​The Dashboard Hierarchy

​Starting From Zero: A Practical Sequence

The Three Pillars

Logs: What Happened

Metrics: How Much / How Fast

Traces: Why It Happened

The Vocabulary That Matters

The Four Golden Signals

OpenTelemetry: The Standard Worth Adopting

The Dashboard Hierarchy

Starting From Zero: A Practical Sequence