Skip to main content

Observability Tooling & Stack

Here’s the toolbox supporting the Observability Galaxy.

1. Telemetry Ingest

  • OpenTelemetry Collector – standardizes traces/logs/metrics.
  • Vector – lightweight agent pushing logs to Loki/S3.
  • AWS Kinesis Firehose – archival for compliance workloads.

2. Storage & Query

NeedToolWhy
MetricsPrometheus + CloudWatchOpen-source control + managed retention.
LogsLoki + S3 GlacierCost-effective, query via Grafana + Athena.
TracesHoneycombHigh-cardinality powerhouse for debugging.
ReplayReplay.ioVisual context for frontend incidents.

3. Visualization & Alerting

  • Grafana dashboards with SLO panels, burn charts, cost overlays.
  • Datadog for exec-friendly dashboards and anomaly detection.
  • Slackbot (custom) posting daily SLO + cost digests.
  • PagerDuty integration with SLO alerts and on-call schedules.

4. Developer Experience

  • otel-cli baked into template repos for quick instrumentation.
  • VS Code snippets for adding SLIs to services.
  • Runbook generator script scaffolds MDX for incidents.

5. Cost Observability Tooling

  • CloudZero or Finch for spend allocation.
  • Custom Metabase dashboards fed by CUR (Cost & Usage Reports).
  • Alerts when spend deviates >10% week-over-week per product.

6. Getting Started

  1. Add OpenTelemetry SDK to service (see template repo).
  2. Configure Collector via Helm chart (infra/observability folder).
  3. Create dashboards + alerts via Terraform modules.
  4. Document in /runbooks/observability/tools/[service].mdx.
This stack evolves frequently; check the repo issues for upcoming experiments (e.g., Grafana Faro, SigNoz).