The Core Principle
Before picking tools, establish the principle: buy observability, don’t build it. Observability infrastructure is not a differentiator. Your logging format isn’t a competitive advantage. The time spent building custom dashboards is time not spent on the product. Pick a managed stack that gives you:- Structured log ingestion
- Metrics with alerting
- Distributed traces
- Cost you can afford as you scale
The Stack
OpenTelemetry: The Foundation
What it is: The vendor-neutral standard for collecting traces, metrics, and logs from your application. Why it matters: It’s the exit ramp from vendor lock-in. Instrument your code once with OTel, then point the collector at any backend. When you want to switch from Datadog to Honeycomb (or vice versa), you change collector config, not application code. Setup (Node.js):Traces: Honeycomb
What it is: Distributed tracing backend optimized for high-cardinality queries and exploratory debugging. Why I use it over alternatives: Honeycomb’s data model treats every trace event as a structured event you can query with any field. This means I can ask: “Show me traces where user_id=123 AND response_time > 2000ms AND error occurred in the payment service” — in real time, without pre-aggregating. Datadog can do this too, but at Honeycomb’s price point it’s 3-5x cheaper for the same query capability. When it earns its place: The first time you’re debugging a production issue and you can trace a specific user’s broken request through 8 services in 30 seconds instead of 3 hours of log grepping, you’ll understand why this is worth paying for. Honest limitation: Honeycomb’s SLO and alerting UI is less mature than Grafana’s. I use Grafana for SLO dashboards and Honeycomb for trace-based investigation.Metrics: Prometheus + Grafana
What it is: Prometheus is the time-series metrics database. Grafana is the visualization layer. Why this combination: It’s the industry standard for a reason. Deep ecosystem (every tool has a Prometheus exporter), powerful query language (PromQL), and Grafana can visualize any data source — Prometheus, Loki, Tempo, InfluxDB, Postgres. Setup for a Node.js service:- Request rate (req/s) over time
- Error rate (%) over time — alert when > 1%
- Latency p50/p95/p99 over time
- Error budget burn rate gauge
- Infrastructure metrics (CPU, memory, active connections)
Logs: Loki + S3
What it is: Loki is a log aggregation system from Grafana Labs. Unlike Elasticsearch, it indexes only metadata (labels), not the full log content — which makes it dramatically cheaper at scale. Why Loki over Elasticsearch/OpenSearch: Cost. Elasticsearch indexes everything, which means high storage and memory costs. Loki stores raw logs in S3 and only indexes labels. For equivalent log volume, Loki costs roughly 1/10th of Elasticsearch. The tradeoff: Full-text search is slower on Loki. If you’re searching for a specific string inside a log message, Loki has to grep through the raw storage rather than querying an index. For most debugging workflows this is fine. For compliance or security log analysis with frequent full-text searches, Elasticsearch is worth the cost. Log shipping with Vector:console.log("User logged in"), change it to logger.info({ event: "user.login", userId, ip }). Structured logs are queryable; unstructured logs are grep-able at best.
Alerting: PagerDuty
What it is: On-call management and alert routing. Why not just use Slack: Slack alerts get buried. PagerDuty wakes people up. When a P1 fires at 3 AM, you need a system that actually pages someone, escalates if they don’t respond, and has clear runbook links. PagerDuty does this well. Alert routing setup:- Error rate > 5% for 5+ minutes
- p95 latency > 2x SLO for 10+ minutes
- Error budget burn rate > 14x (will exhaust budget in <2 days)
- Complete service unavailability
- Error rate 1-5% (investigate next business day)
- Latency elevation below SLO breach
- Cost anomaly (important but not urgent)
Frontend Observability: Replay.io
What it is: Session replay with DevTools-level debugging. Unlike FullStory or LogRocket, Replay.io records a deterministic replay — you can open DevTools on the recording and step through the JavaScript execution. When I use it: When users report frontend bugs that I can’t reproduce. I send affected users a link to record their session, they trigger the bug, and I get a full replay with console logs, network requests, and the ability to add console.logs retroactively. Honest take: This is a premium debugging tool, not basic observability. Skip it if budget is tight. Use it if frontend bugs are a recurring pain point that’s hard to reproduce.Cost Observability
Infrastructure cost is an SLO, not a finance problem. Here’s how I track it. AWS Cost Anomaly Detection (free with AWS):The Minimum Viable Observability Stack
If you’re starting from zero and have limited time:| Priority | Tool | Cost | Time to set up |
|---|---|---|---|
| 1 | Structured logging (Pino/Winston → stdout) | Free | 1 hour |
| 2 | Error tracking (Sentry) | Free tier available | 30 minutes |
| 3 | Uptime monitoring (Better Uptime) | Free tier | 15 minutes |
| 4 | Prometheus metrics + Grafana Cloud | Free tier | 2 hours |
| 5 | Honeycomb tracing | Free tier (20GB/month) | 2 hours |
