Observability Tooling & Stack
Here’s the toolbox supporting the Observability Galaxy.1. Telemetry Ingest
- OpenTelemetry Collector – standardizes traces/logs/metrics.
- Vector – lightweight agent pushing logs to Loki/S3.
- AWS Kinesis Firehose – archival for compliance workloads.
2. Storage & Query
| Need | Tool | Why |
|---|---|---|
| Metrics | Prometheus + CloudWatch | Open-source control + managed retention. |
| Logs | Loki + S3 Glacier | Cost-effective, query via Grafana + Athena. |
| Traces | Honeycomb | High-cardinality powerhouse for debugging. |
| Replay | Replay.io | Visual context for frontend incidents. |
3. Visualization & Alerting
- Grafana dashboards with SLO panels, burn charts, cost overlays.
- Datadog for exec-friendly dashboards and anomaly detection.
- Slackbot (custom) posting daily SLO + cost digests.
- PagerDuty integration with SLO alerts and on-call schedules.
4. Developer Experience
otel-clibaked into template repos for quick instrumentation.- VS Code snippets for adding SLIs to services.
- Runbook generator script scaffolds MDX for incidents.
5. Cost Observability Tooling
- CloudZero or Finch for spend allocation.
- Custom Metabase dashboards fed by CUR (Cost & Usage Reports).
- Alerts when spend deviates >10% week-over-week per product.
6. Getting Started
- Add OpenTelemetry SDK to service (see template repo).
- Configure Collector via Helm chart (
infra/observabilityfolder). - Create dashboards + alerts via Terraform modules.
- Document in
/runbooks/observability/tools/[service].mdx.
Next up: Productivity →
