Skip to main content
Monitoring tells you something is wrong. Observability tells you why. Most teams have monitoring. They have uptime checks, error rate alerts, and dashboards that go red when things break. What they don’t have is the ability to answer a question they’ve never asked before using data they’re already collecting. That’s the goal of observability: make your system understandable from the outside, without having to redeploy or add instrumentation every time you hit a new kind of failure.

The Three Pillars

Observability is built on three types of telemetry. Understanding each, and how they complement each other, is the foundation.

Logs: What Happened

Logs are the narrative of your system — timestamped records of individual events. Structured logs beat unstructured logs in every way:
// Bad: unstructured log
console.log(`User ${userId} uploaded file ${filename} at ${new Date()}`);

// Good: structured JSON log
logger.info({
  event: "file_uploaded",
  userId,
  filename,
  fileSize: file.size,
  mimeType: file.type,
  durationMs: Date.now() - startTime,
  requestId,  // Correlation ID to trace the request
});
Structured logs are queryable. “Show me all file uploads over 10MB in the last hour” takes 2 seconds. With unstructured logs, it’s a regex nightmare.

Metrics: How Much / How Fast

Metrics are numerical measurements collected over time. Unlike logs, they’re aggregated — you don’t store every individual value, you store counts, sums, and distributions. The metrics worth tracking for most services:
MetricWhat it tells youHow to measure
Request rateTraffic volume and patternsCounter: increment per request
Error rateSystem healthCounter: increment per 4xx/5xx
Latency (p50, p95, p99)User experienceHistogram: record duration per request
SaturationHow close to limitGauge: CPU %, queue depth, connection pool usage
// Prometheus-style metrics with prom-client
import client from 'prom-client';

const httpRequestDuration = new client.Histogram({
  name: 'http_request_duration_ms',
  help: 'Duration of HTTP requests in ms',
  labelNames: ['method', 'route', 'status_code'],
  buckets: [5, 10, 25, 50, 100, 250, 500, 1000, 2500, 5000]
});

// In your request handler middleware:
const end = httpRequestDuration.startTimer();
// ... handle request ...
end({ method: req.method, route: req.route?.path, status_code: res.statusCode });

Traces: Why It Happened

Traces follow a request through every service and function call it touches — a causal chain from the user’s action to the database query and back. Without traces, you know the request took 267ms. With traces, you see the payment service took 145ms and has high variance — that’s where to look.
import { trace, context, SpanStatusCode } from '@opentelemetry/api';

const tracer = trace.getTracer('order-service');

async function createOrder(cart: Cart): Promise<Order> {
  return tracer.startActiveSpan('order.create', async (span) => {
    span.setAttributes({
      'order.cart_id': cart.id,
      'order.item_count': cart.items.length,
      'order.total': cart.total,
    });

    try {
      const inventory = await checkInventory(cart); // Creates child span
      const order = await persistOrder(cart, inventory); // Creates child span
      span.setStatus({ code: SpanStatusCode.OK });
      return order;
    } catch (error) {
      span.recordException(error);
      span.setStatus({ code: SpanStatusCode.ERROR });
      throw error;
    } finally {
      span.end();
    }
  });
}

The Vocabulary That Matters

These terms get confused constantly. Here’s the precise definition of each:
TermDefinitionExample
SLI (Service Level Indicator)A measurable signal about service quality”p95 API latency = 342ms”
SLO (Service Level Objective)Your internal target for an SLI”p95 latency < 500ms for 99% of requests”
SLA (Service Level Agreement)An external contractual promise, often with financial consequences”99.9% uptime guarantee in customer contract”
Error BudgetHow much you’re allowed to fail: 1 - SLOIf SLO is 99.9%, error budget is 0.1% downtime per month (~43 minutes)
The relationship matters: SLAs are promises to customers. SLOs are your internal targets, tighter than the SLA to give you buffer. SLIs are the measurements you use to know whether you’re meeting your SLOs. Error budgets are the most important concept to internalize. If your monthly error budget is 43 minutes and you’ve used 38 minutes in week 2, you need to slow down risky changes. If you have 40 minutes left on the last day of the month, you can move fast.

The Four Golden Signals

Google’s SRE book introduced the four golden signals — the minimum you need to tell if a service is healthy: If you can only instrument four things, instrument these four. If a service has healthy golden signals, it’s almost certainly working. If the golden signals look bad, you know where to start.

OpenTelemetry: The Standard Worth Adopting

OpenTelemetry (OTel) is the vendor-neutral standard for observability instrumentation. It matters because it lets you:
  • Instrument your code once
  • Send telemetry to any backend (Honeycomb, Datadog, Grafana, etc.)
  • Switch backends without changing application code
Setup in a Node.js service:
// instrumentation.ts — must be loaded before everything else
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-http';
import { PeriodicExportingMetricReader } from '@opentelemetry/sdk-metrics';

const sdk = new NodeSDK({
  serviceName: 'order-service',
  traceExporter: new OTLPTraceExporter({
    url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT,
  }),
  metricReader: new PeriodicExportingMetricReader({
    exporter: new OTLPMetricExporter(),
    exportIntervalMillis: 60000,
  }),
  instrumentations: [
    getNodeAutoInstrumentations(), // Auto-instruments HTTP, DB, gRPC etc.
  ],
});

sdk.start();
With auto-instrumentation, you get traces for HTTP requests, database queries, and outbound calls without writing any trace code. Add custom spans for your business logic on top.

The Dashboard Hierarchy

Not everyone needs the same view. Build for different audiences: Executive / Product view:
  • Overall system health (SLO compliance)
  • User-facing error rates
  • Business metric health (orders processed, signups, etc.)
On-call / Engineering view:
  • Golden signals per service
  • Error budget burn rate
  • Recent deployments and their impact
  • Active alerts with runbook links
Debugging view (per incident):
  • Request traces with full context
  • Log correlation by request ID
  • Infrastructure metrics (CPU, memory, database connections)

Starting From Zero: A Practical Sequence

If you’re building observability from scratch, do it in this order:
  1. Structured logging first — Add structured JSON logging everywhere. This gives you the most immediate value and requires no infrastructure.
  2. Correlation IDs — Generate a request ID at the entry point and pass it through every log statement and service call. This is the foundation of tracing.
  3. Four golden signals — Instrument latency, error rate, traffic, and saturation for your most critical service.
  4. Alerts on error rate — Before building dashboards, get paged when errors spike. This is the most urgent safety net.
  5. SLO definitions — Define what “healthy” means for each service before building more dashboards.
  6. Distributed traces — Once you have SLOs and golden signals, add distributed tracing to make debugging fast.
The temptation is to build comprehensive dashboards first. Resist it. The most valuable observability investment in the early stages is getting paged on real problems and having enough information to debug them.